Despite the frequently expressed antipathy for Regular Expressions on the part of many in the Python community, they're really a precious tool for the appropriate use cases -- which definitely include identifying words and phrases (thanks to the \b
"word boundary" element in regular expression patterns -- string-processing based alternatives are much more of a problem, e.g., .split()
uses whitespace as the separator and thus annoyingly leave punctuation attached to words adjacent to it, etc, etc).
If RE's are OK, I would recommend something like:
import re
import sys
def main():
if len(sys.argv) != 3:
print("Usage: %s fileofstufftofind filetofinditin" % sys.argv[0])
sys.exit(1)
with open(sys.argv[1]) as f:
patterns = [r'\b%s\b' % re.escape(s.strip) for s in f]
there = '|'.join(patterns)
with open(sys.argv[2]) as f:
for i, s in enumerate(f):
if there.search(s):
print("Line %s: %r" % (i, s))
main()
the first argument being (the path of) a text file with words or phrases to find, one per line, and the second argument (the path of) a text file in which to find them. It's easy, if desired, to make the case search-insensitive (perhaps just optionally based on a command line option switch), etc, etc.
Some explanation for readers that are not familiar with REs...:
The \b
item in the patterns
items ensures that there will be no accidental matches (if you're searching for "cat" or "dog", you won't see an accidental hit with "catalog" or "underdog"; and you won't miss a hit in "The cat, smiling, ran away" by some splitting thinking that the word there is "cat," including the comma;-).
The |
item means or
, so e.g. from a text file with contents (two lines)
cat
dog
this will form the pattern '\bcat\b|\bdog\b'
which will locate either "cat" or "dog" (as stand-alone words, ignoring punctuation, but rejecting hits within longer words).
The re.escape
escapes punctuation so it's matched literally, not with special meaning as it would normally have in a RE pattern.