I’m a Literature grad student, and I’ve been going through the O’Reilly book in Natural Language Processing (nltk.org/book). It looks incredibly useful. I’ve played around with all the example texts and example tasks in Chapter 1, like concordances. I now know how many times Moby Dick uses the word “whale.” The problem is, I can’t figure out how to do these calculations on one of my own texts. I’ve found information on how to create my own corpora (Ch. 2 of the O’Reilly book), but I don’t think that’s exactly what I want to do. In other words, I want to be able to do

import nltk 

and get the places where the word ‘yellow’ is used in my text. At the moment I can do this with the example texts, but not my own.

I’m very new to python and programming, and so this stuff is very exciting, but very confusing.

Found the answer myself. That’s embarrassing. Or awesome.

From Ch. 3:

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

Does the trick.

For a structured import of multiple files:

from nltk.corpus import PlaintextCorpusReader

# RegEx or list of file names
files = ".*\.txt"

corpus0 = PlaintextCorpusReader("/path/", files)
corpus  = nltk.Text(corpus0.words())

see: NLTK 3 book / section 1.9

If your text file is in utf8 format, try the following variation:

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)