tags:

views:

149

answers:

8

hi, I have a list of approximately 300 words and a huge amount of text that I want to scan to know how many times each word appears.

I am using the re module from python:

for word in list_word:
    search = re.compile(r"""(\s|,)(%s).?(\s|,|\.|\))""" % word)
    occurrences = search.subn("", text)[1]

but I want to know if there is a more efficient or more elegant way of doing this?

A: 

Googling: python frequency gives me this page as the first result: http://www.daniweb.com/code/snippet216747.html

Which seems to be what you're looking for.

Joubert Nel
Its un-pythonish with all these regexes. Splitting into separate words is best achieved with str.split() rather than custom regex
Daniel Kluev
you're right, if the Python string functions are sufficient, they should be used in lieu of regex.
Joubert Nel
A: 

You can also split the text into words and search the resulting list.

Assaf Lavie
A: 

Regular expressions may not be what you want. Python has a number of built-in string operations that are much faster, and I believe .count() has what you need.

http://docs.python.org/library/stdtypes.html#string-methods

thebackhand
+5  A: 

If you have a huge amount of text, I wouldn't use regexps in this case but simply split text:

words = {"this": 0, "that": 0}
for w in text.split():
  if w in words:
    words[w] += 1

words will give you the frequency for each word

Adam Schmideg
Definitely more efficient to only scan the text once. Code snippet above just seems to be missing the check that the word is one of the 300 "important" ones.
pdbartlett
@pdbartlett `if w in words` makes that check.
Wilduck
Splitting on whitespace isn't always going to lead to perfect results. If you need sophisticated splitting, you can take a look at NLTK, which has been suggested below.
Tim McNamara
@wilduck: Of course - not sure how I missed that :o
pdbartlett
+1  A: 

Try stripping all the punctuation from your text and then splitting on whitespace. Then simply do

for word in list_word:
    occurence = strippedText.count(word)

Or if you're using python 3.0 I think you could do:

occurences = {word: strippedText.count(word) for word in list_word}
jacobangel
in 2.6 <= python < 3.0 you can do `occurences = dict((word, strippedText.count(word)) for word in list_word)`
Wilduck
A: 

If Python is not a must, you can use awk

$ cat file
word1
word2
word3
word4

$ cat file1
blah1 blah2 word1 word4 blah3 word2
junk1 junk2 word2 word1 junk3
blah4 blah5 word3 word6 end

$ awk 'FNR==NR{w[$1];next} {for(i=1;i<=NF;i++) a[$i]++}END{for(i in w){ if(i in a) print i,a[i] } } ' file file1
word1 2
word2 2
word3 1
word4 1
ghostdog74
A: 

It sounds to me like the Natural Language Toolkit might have what you need.

http://www.nltk.org/

Glenjamin
Specifically the `nltk.FreqDist` class.
Tim McNamara
A: 

Maybe you could adapt this my multisearch generator function.

    from itertools import islice
testline = "Sentence 1.  Sentence 2?  Sentence 3!  Sentence 4.  Sentence 5."
def multis(search_sequence,text,start=0):
    """ multisearch by given search sequence values from text, starting from position start
        yielding tuples of text before sequence item and found sequence item"""
    x=''
    for ch in text[start:]:
        if ch in search_sequence:
            if x: yield (x,ch)
            else: yield ch
            x=''
        else:
            x+=ch
    else:
        if x: yield x

# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences

print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)
Tony Veijalainen