Each Answer to this Q is separated by one/two green lines.
I am using the Gensim Python package to learn a neural language model, and I know that you can provide a training corpus to learn the model. However, there already exist many precomputed word vectors available in text format (e.g. http://www-nlp.stanford.edu/projects/glove/). Is there some way to initialize a Gensim Word2Vec model that just makes use of some precomputed vectors, rather than having to learn the vectors from scratch?
The GloVe dump from the Stanford site is in a format that is little different from the word2vec format. You can convert the GloVe file into word2vec format using:
python -m gensim.scripts.glove2word2vec --input glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
You can download pre-trained word vectors from here (get the file ‘GoogleNews-vectors-negative300.bin’):
Extract the file and then you can load it in python like:
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True) model.most_similar('dog')
EDIT (May 2017):
As the above code is now deprecated, this is how you’d load the vectors now:
model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)
As far as I know, Gensim can load two binary formats, word2vec and fastText, and a generic plain text format which can be created by most word embedding tools. The generic plain text format looks like this (in this example 20000 is the size of the vocabulary and 100 is the length of vector)
20000 100 the 0.476841 -0.620207 -0.002157 0.359706 -0.591816 [98 more numbers...] and 0.223408 0.231993 -0.231131 -0.900311 -0.225111 [98 more numbers..] [19998 more lines...]
Chaitanya Shivade has explained in his answer here, how to use a script provided by Gensim to convert the Glove format (each line: word + vector) into the generic format.
Loading the different formats is easy, but it is also easy to get them mixed up:
import gensim model_file = path/to/model/file
1) Loading binary word2vec
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(model_file)
2) Loading binary fastText
model = gensim.models.fasttext.FastText.load_fasttext_format(model_file)
3) Loading the generic plain text format (which has been introduced by word2vec)
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)
If you only plan to use the word embeddings and not to continue to train them in Gensim, you may want to use the KeyedVector class. This will reduce the amount of memory you need to load the vectors considerably (detailed explanation).
The following will load the binary word2vec format as keyedvectors:
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file, binary=True)