I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text="hello bye the the hi"
text=" ".join([word for word in text.split() if word not in (stopwords.words('english'))])

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I’m thinking of using something like regex’s re.sub but I don’t know how to write the pattern for a set of words. Can someone give me a hand and I’m also happy to hear other possibly faster methods.

Note: I tried someone’s suggest of wrapping stopwords.words('english') with set() but that made no difference.

Thank you.

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text="hello bye the the hi"
        text=" ".join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text="hello bye the the hi"
        text=" ".join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

Use a regexp to remove all words which do not match:

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

Sorry for late reply.
Would prove useful for new users.

  • Create a dictionary of stopwords using collections library
  • Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

    from collections import Counter
    stop_words = stopwords.words('english')
    stopwords_dict = Counter(stop_words)
    text=" ".join([word for word in text.split() if word not in stopwords_dict])
    

First, you’re creating stop words for each string. Create it once. Set would be great here indeed.

forbidden_words = set(stopwords.words('english'))

Later, get rid of [] inside join. Use generator instead.

Replace

' '.join([x for x in ['a', 'b', 'c']])

with

' '.join(x for x in ['a', 'b', 'c'])

Next thing to deal with would be to make .split() yield values instead of returning an array. I believe regex would be good replacement here. See thist hread for why s.split() is actually fast.

Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

Try using this by avoid looping and instead using regex to remove stopwords:

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)

Using just a regular dict seems to be the fastest solution by far.
Surpassing even the Counter solution by about 10%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text="hello bye the the hi"
text = " ".join([word for word in text.split() if word not in stopwords_dict])

Tested using the cProfile profiler

You can find the test code used here:
https://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682

EDIT:

On top of that if we replace the list comprehension with a loop we get another 20% increase in performance

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text="hello bye the the hi"

new = ""
for word in text.split():
    if word not in stopwords_dict:
        new += word
text = new