views:

87

answers:

2

Basically as the question states. I am fairly new to Python and like to learn by seeing and doing.

I would like to create a script that searches through a text document (say the text copied and pasted from a news article for example) for certain words or phrases. Ideally, the list of words and phrases would be stored in a separate file.

When getting the results, it would be great to get the context of the results. So maybe it could print out the 50 characters in the text file before and after each search term that has been found. It'd be cool if it also showed what line the search term was found on.

Any pointers on how to code this, or even code examples would be much appreciated.

+1  A: 

Start with something like this. This code is not an exact solution for the specification you have, but it is a good starting point.

import sys

words = "foo bar baz frob"

word_set = set(words.split())
for line_number, line in enumerate(open(sys.argv[1])):
    if words_set.intersection(line.split()):
        print "%d:%s" % (line_number, line.strip())

Some explanations below:

  • The words being sought are stored in a string initially (in line 3). I split this wordlist along whitespaces and create a set out of it so it is easier to check whether any of the words in the current line are to be found in the wordlist. (Membership check on a set is O(1), while it is O(n) on a list).

  • In the main for loop, I open the input file (which is passed as a command line argument) and use the enumerate built-in method to get a line number counter as well as the actual line. sys.argv is an array storing the command line arguments; sys.argv[0] is always the name of the Python script.

  • In the loop itself, I take the current line, split it to individual words and create a set out of the words again. Then I can quickly take the intersection of the wordset in the current line with the set of words I am looking for. If the intersection has a logical True value (i.e. if it is not empty), I print the line number as well as the line.

Things that are not solved yet (and left up to you):

  • The list of words are now hard-coded in the source, but it should not be too hard to open an extra file (whose name is passed in, say, sys.argv[2]), read its words one by one and store them in a set. Note that you can extend sets by their add and update methods (instead of append and extend which work for lists).

  • Obviously the above method does not work if you have phrases instead of words (as pointed out in one of the comments). As I assume that you want to learn and you don't need an exact solution, I will only say that if you have phrases in a set, you can check whether any of the set elements is in a line by saying any(phrase in line for phrase in set_of_phrases). This can be used in place of the set intersection (and of course don't split your line into words in this case).

  • If you want to print the context of the hits, you can use two extra variables (say, prev_line and next_line) that stores the previous line and the next line. In the for loop, you will actually be reading next_line instead of line, and at the end of the for loop, you should take care of copying line into prev_line and next_line into line.

  • An even more Pythonic way of keeping track of the previous and the next line as well is to create a Python generator function that yields a tuple consisting of item i-1, item i and item i+1 for each i given an iterable (like a file). This is more advanced stuff, though, and as you are fairly new to Python, I think it's best to leave it for later. However, if you are curious, a generator function doing this task might look like this:

    def context_generator(iterable):
        prev, current, next = None, None, None
        for element in iterable:
            prev, current, next = current, next, element
            if current is not None:
                yield prev, current, next
        if next is not None:
            yield current, next, None
    
Tamás
to *open the input file* you need to use `open`.
SilentGhost
sure, my bad, thanks.
Tamás
This also doesn't work with phrases, just individual words.
FogleBird
also you don't need to convert words in line into set, it could be done internally by `word_set.intersection(line.split())`
SilentGhost
@FogleBird @SilentGhost: thanks for the comments. I've taken some sort of an "iterative" approach and I was improving my answer after sending it. Your suggestions have been included in my answer. As for phrases, I don't want to give an exact out-of-the-box solution as I feel it's better if the original poster figures it out himself, using my answer as a guideline only. I have mentioned the case of phrases in one of the bullet points.
Tamás
Thanks for the detailed reply, exactly what I needed - will read with interest.
prupert
+1  A: 

Despite the frequently expressed antipathy for Regular Expressions on the part of many in the Python community, they're really a precious tool for the appropriate use cases -- which definitely include identifying words and phrases (thanks to the \b "word boundary" element in regular expression patterns -- string-processing based alternatives are much more of a problem, e.g., .split() uses whitespace as the separator and thus annoyingly leave punctuation attached to words adjacent to it, etc, etc).

If RE's are OK, I would recommend something like:

import re
import sys

def main():
  if len(sys.argv) != 3:
    print("Usage: %s fileofstufftofind filetofinditin" % sys.argv[0])
    sys.exit(1)

  with open(sys.argv[1]) as f:
    patterns = [r'\b%s\b' % re.escape(s.strip) for s in f]
  there = '|'.join(patterns)

  with open(sys.argv[2]) as f:
    for i, s in enumerate(f):
      if there.search(s):
        print("Line %s: %r" % (i, s))

main()

the first argument being (the path of) a text file with words or phrases to find, one per line, and the second argument (the path of) a text file in which to find them. It's easy, if desired, to make the case search-insensitive (perhaps just optionally based on a command line option switch), etc, etc.

Some explanation for readers that are not familiar with REs...:

The \b item in the patterns items ensures that there will be no accidental matches (if you're searching for "cat" or "dog", you won't see an accidental hit with "catalog" or "underdog"; and you won't miss a hit in "The cat, smiling, ran away" by some splitting thinking that the word there is "cat," including the comma;-).

The | item means or, so e.g. from a text file with contents (two lines)

cat
dog

this will form the pattern '\bcat\b|\bdog\b' which will locate either "cat" or "dog" (as stand-alone words, ignoring punctuation, but rejecting hits within longer words).

The re.escape escapes punctuation so it's matched literally, not with special meaning as it would normally have in a RE pattern.

Alex Martelli
Again, thanks for the ace reply - some code with an explanation is very helpful. I had wondered about RE, but wasn't sure if it was relevant in this case - good to see it is!
prupert