ansaurus

Question

Python: best/efficient way of finding a list of words in a text?

Answer 1

A:

Googling: python frequency gives me this page as the first result: http://www.daniweb.com/code/snippet216747.html

Which seems to be what you're looking for.

Joubert Nel 2010-07-30 14:22:24

Its un-pythonish with all these regexes. Splitting into separate words is best achieved with str.split() rather than custom regex

Daniel Kluev 2010-07-30 14:36:52

you're right, if the Python string functions are sufficient, they should be used in lieu of regex.

Joubert Nel 2010-07-30 16:36:51

Answer 2

A:

You can also split the text into words and search the resulting list.

Assaf Lavie 2010-07-30 14:23:04

Answer 3

A:

Regular expressions may not be what you want. Python has a number of built-in string operations that are much faster, and I believe .count() has what you need.

http://docs.python.org/library/stdtypes.html#string-methods

thebackhand 2010-07-30 14:24:01

Answer 4

+5 A:

If you have a huge amount of text, I wouldn't use regexps in this case but simply split text:

words = {"this": 0, "that": 0}
for w in text.split():
  if w in words:
    words[w] += 1

words will give you the frequency for each word

Adam Schmideg 2010-07-30 14:25:40

Definitely more efficient to only scan the text once. Code snippet above just seems to be missing the check that the word is one of the 300 "important" ones.

pdbartlett 2010-07-30 14:28:12

@pdbartlett `if w in words` makes that check.

Wilduck 2010-07-30 14:41:42

Splitting on whitespace isn't always going to lead to perfect results. If you need sophisticated splitting, you can take a look at NLTK, which has been suggested below.

Tim McNamara 2010-07-30 20:40:46

@wilduck: Of course - not sure how I missed that :o

pdbartlett 2010-07-31 22:35:14

Answer 5

+1 A:

Try stripping all the punctuation from your text and then splitting on whitespace. Then simply do

for word in list_word:
    occurence = strippedText.count(word)

Or if you're using python 3.0 I think you could do:

occurences = {word: strippedText.count(word) for word in list_word}

jacobangel 2010-07-30 14:27:18

in 2.6 <= python < 3.0 you can do `occurences = dict((word, strippedText.count(word)) for word in list_word)`

Wilduck 2010-07-30 14:44:55

Answer 6

A:

If Python is not a must, you can use awk

$ cat file
word1
word2
word3
word4

$ cat file1
blah1 blah2 word1 word4 blah3 word2
junk1 junk2 word2 word1 junk3
blah4 blah5 word3 word6 end

$ awk 'FNR==NR{w[$1];next} {for(i=1;i<=NF;i++) a[$i]++}END{for(i in w){ if(i in a) print i,a[i] } } ' file file1
word1 2
word2 2
word3 1
word4 1

ghostdog74 2010-07-30 14:41:57

Answer 7

A:

It sounds to me like the Natural Language Toolkit might have what you need.

http://www.nltk.org/

Glenjamin 2010-07-30 15:20:27

Specifically the `nltk.FreqDist` class.

Tim McNamara 2010-07-30 20:38:44

Answer 8

A:

Maybe you could adapt this my multisearch generator function.

    from itertools import islice
testline = "Sentence 1.  Sentence 2?  Sentence 3!  Sentence 4.  Sentence 5."
def multis(search_sequence,text,start=0):
    """ multisearch by given search sequence values from text, starting from position start
        yielding tuples of text before sequence item and found sequence item"""
    x=''
    for ch in text[start:]:
        if ch in search_sequence:
            if x: yield (x,ch)
            else: yield ch
            x=''
        else:
            x+=ch
    else:
        if x: yield x

# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences

print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

Tony Veijalainen 2010-07-30 15:56:07

ansaurus

tags:

views:

answers:

Python: best/efficient way of finding a list of words in a text?

related questions