views:

49

answers:

2

I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I can learn better coding, but also so that I can get this project done faster!

def searchText(searchword):
    counts = []
    corpus_root = 'some_dir'
    wordlists = PlaintextCorpusReader(corpus_root, '.*')
    for id in wordlists.fileids():
        date = id[4:12]
        month = date[-4:-2]
        day = date[-2:]
        year = date[:4]
        raw = wordlists.raw(id)
        tokens = nltk.word_tokenize(raw)
        text = nltk.Text(tokens)
        count = text.count(searchword)
        counts.append((month, day, year, count))

    return counts
A: 

Run a profiler such as in the module profile and see what that says.

John
+1  A: 

If you just want a frequency of word counts, then you don't need to create nltk.Text objects, or even use nltk.PlainTextReader. Instead, just go straight to nltk.FreqDist.

files = list_of_files
fd = nltk.FreqDist()
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                fd.inc(word)

Or, if you don't want to do any analysis - just use a dict.

files = list_of_files
fd = {}
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                try:
                    fd[word] = fd[word]+1
                except KeyError:
                    fd[word] = 1

These could be made much more efficient with generator expressions, but I'm used for loops for readability.

Tim McNamara
Thanks, this is really helpful!
Mark Bellhorn
Am glad to help. Welcome to StackOverflow!
Tim McNamara