[Solved] Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words”
I’m trying to use Python’s Tfidf to transform a corpus of text.
However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words.
In : TfidfVectorizer().fit_transform(smallcorp) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-69-ac16344f3129> in <module>() ----> 1 TfidfVectorizer().fit_transform(smallcorp) /Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y) 1217 vectors : array, [n_samples, n_features] 1218 """ -> 1219 X = super(TfidfVectorizer, self).fit_transform(raw_documents) 1220 self._tfidf.fit(X) 1221 # X is already a transformed view of raw_documents so /Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y) 778 max_features = self.max_features 779 --> 780 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) 781 X = X.tocsc() 782 /Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab) 725 vocabulary = dict(vocabulary) 726 if not vocabulary: --> 727 raise ValueError("empty vocabulary; perhaps the documents only" 728 " contain stop words") 729 ValueError: empty vocabulary; perhaps the documents only contain stop words
I read through the SO question here: Problems using a custom vocabulary for TfidfVectorizer scikit-learn and tried ogrisel’s suggestion of using TfidfVectorizer(**params).build_analyzer()(dataset2) to check the results of the text analysis step and that seems to be working as expected: snippet below:
In : TfidfVectorizer().build_analyzer()(smallcorp) Out: [u'due', u'to', u'lack', u'of', u'personal', u'biggest', u'education', u'and', u'husband', u'to',
Is there something else that I am doing wrong? the corpus I am feeding it is just one giant long string punctuated by newlines.
I guess it’s because you just have one string. Try splitting it into a list of strings, e.g.:
In : smallcorp Out: 'Ah! Now I have done Philosophy,nI have finished Law and Medicine,nAnd sadly even Theology:nTaken fierce pains, from end to end.nNow here I am, a fool for sure!nNo wiser than I was before:' In : tf = TfidfVectorizer() In : tf.fit_transform(smallcorp.split('n')) Out: <6x28 sparse matrix of type '<type 'numpy.float64'>' with 31 stored elements in Compressed Sparse Row format>
In version 0.12, we set the minimum document frequency to 2, which means that only words that appear at least twice will be considered. For your example to work, you need to set
min_df=1. Since 0.13, this is the default setting.
So I guess you are using 0.12, right?
You can alternatively put your single string as a tuple, if you insist to have only one string. Instead of having:
smallcorp = "your text"
you’d rather put it within a tuple.
In : smallcorp = ("your text",) In : tf.fit_transform(smallcorp) Out: <1x2 sparse matrix of type '<type 'numpy.float64'>' with 2 stored elements in Compressed Sparse Row format>
I encountered a similar error while running a TF-IDF Python 3 script over a large corpus. Some small files (apparently) lacked keywords, throwing an error message.
I tried several solutions (adding dummy strings to my
filtered list if
len(filtered = 0, …) that did not help. The simplest solution was to add a
try: ... except ... continue expression.
pattern = "(?u)\b[\w-]+\b" cv = CountVectorizer(token_pattern=pattern) # filtered is a list filtered = [w for w in filtered if not w in my_stopwords and not w.isdigit()] # ValueError: # cv.fit(text) # File "tfidf-sklearn.py", line 1675, in tfidf # cv.fit(filtered) # File "/home/victoria/venv/py37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1024, in fit # self.fit_transform(raw_documents) # ... # ValueError: empty vocabulary; perhaps the documents only contain stop words # Did not help: # https://stackoverflow.com/a/20933883/1904943 # # if len(filtered) == 0: # filtered = ['xxx', 'yyy', 'zzz'] # Solution: try: cv.fit(filtered) cv.fit_transform(filtered) doc_freq_term_matrix = cv.transform(filtered) except ValueError: continue
I also had the same problem.
Transform list of int(nums) to list of str(nums) didn’t help.
But I converted to:
['d'+str(nums) for nums in set] #where d is some letter which mention, we work with strings