Each Answer to this Q is separated by one/two green lines.
So I have a dataset that I would like to remove stop words from using
I’m struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i’m struggling with is comparing to this list and removing the stop words.
Any help is appreciated.
from nltk.corpus import stopwords # ... filtered_words = [word for word in word_list if word not in stopwords.words('english')]
You could also do a set diff, for example:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
To exclude all type of stop-words including nltk stop-words, you could do something like this:
from stop_words import get_stop_words from nltk.corpus import stopwords stop_words = list(get_stop_words('en')) #About 900 stopwords nltk_words = list(stopwords.words('english')) #About 150 stopwords stop_words.extend(nltk_words) output = [w for w in word_list if not w in stop_words]
I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:
filtered_word_list = word_list[:] #make a copy of the word_list for word in word_list: # iterate over word_list if word in stopwords.words('english'): filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
There’s a very simple light-weight python package
stop-words just for this sake.
Fist install the package using:
pip install stop-words
Then you can remove your words in one line using list comprehension:
from stop_words import get_stop_words filtered_words = [word for word in dataset if word not in get_stop_words('english')]
This package is very light-weight to download (unlike nltk), works for both
Python 2 and
Python 3 ,and it has stop words for many other languages like:
Arabic Bulgarian Catalan Czech Danish Dutch English Finnish French German Hungarian Indonesian Italian Norwegian Polish Portuguese Romanian Russian Spanish Swedish Turkish Ukrainian
Use textcleaner library to remove stopwords from your data.
Follow these steps to do so with this library.
pip install textcleaner
import textcleaner as tc data = tc.document(<file_name>) #you can also pass list of sentences to the document class constructor. data.remove_stpwrds() #inplace is set to False by default
Use above code to remove the stop-words.
Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):
STOPWORDS = set(stopwords.words('english')) text=" ".join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
you can use this function, you should notice that you need to lower all the words
from nltk.corpus import stopwords def remove_stopwords(word_list): processed_word_list =  for word in word_list: word = word.lower() # in case they arenet all lower cased if word not in stopwords.words("english"): processed_word_list.append(word) return processed_word_list
from nltk.corpus import stopwords # ... filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.
In some cases, you don’t want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
The library is called
'textfeatures'. You can use it as follows:
! pip install textfeatures import textfeatures as tf import pandas as pd
For example, suppose you have the following set of strings:
texts = [ "blue car and blue window", "black crow in the window", "i see my reflection in the window"] df = pd.DataFrame(texts) # Convert to a dataframe df.columns = ['text'] # give a name to the column df
Now, call the stopwords() function and pass the parameters you want:
tf.stopwords(df,"text","stopwords") # extract stop words df[["text","stopwords"]].head() # give names to columns
The result is going to be:
text stopwords 0 blue car and blue window [and] 1 black crow in the window [in, the] 2 i see my reflection in the window [i, my, in, the]
As you can see, the last column has the stop words included in that docoument (record).
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize example_sent = "This is a sample sentence, showing off the stop words filtration." stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example_sent) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence =  for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(word_tokens) print(filtered_sentence)
In case your data are stored as a
Pandas DataFrame, you can use
remove_stopwords from textero that use the NLTK stopwords list by default.
import pandas as pd import texthero as hero df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
I will show you some example
First I extract the text data from the data frame (
twitter_df) to process further as following
from nltk.tokenize import word_tokenize tweetText = twitter_df['text']
Then to tokenize I use the following method
from nltk.tokenize import word_tokenize tweetText = tweetText.apply(word_tokenize)
Then, to remove stop words,
from nltk.corpus import stopwords nltk.download('stopwords') stop_words = set(stopwords.words('english')) tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words]) tweetText.head()
I Think this will help you
import sys print ("enter the string from which you want to remove list of stop words") userstring = input().split(" ") list =["a","an","the","in"] another_list =  for x in userstring: if x not in list: # comparing from the list and removing it another_list.append(x) # it is also possible to use .remove for x in another_list: print(x,end=' ') # 2) if you want to use .remove more preferred code import sys print ("enter the string from which you want to remove list of stop words") userstring = input().split(" ") list =["a","an","the","in"] another_list =  for x in userstring: if x in list: userstring.remove(x) for x in userstring: print(x,end = ' ') #the code will be like this