views:

85

answers:

3

Hello all! I am new to python (and this site); I am trying to write a script that will use a regular expression to search through a given file to find a name. I have to print out the different ways the name was capitalized and how many times the name was found. My current code will just print out my first flag and then hang. I don't know if my for loop or my reg exp is wrong. Thanks for your time!

import re
import sys
if __name__ == '__main__':
    print "flag"
    for line in sys.stdin:
        print(line) 
        name_count = re.search("[snyder]", line, re.I)
        variation = set(re.search(r"([snyder])", line, re.I))
    print "flag2"
    print len(name_count), variation
+3  A: 

Its not hanging, its waiting for input on stdin! replace for line in sys.stdin by something like:

import fileinput

for line in fileinput.input("sample.txt"):
ennuikiller
thanks! this worked for me.
amazinghorse24
+4  A: 

First I would say to take a look at this thread for a bit of information about reading from stdin (if that's really what you want to do).

Second, I would consider just opening the file instead of reading from sys.stdin, either using a library like fileinput or a with statement or other file handle.

Next, I would add that your regular expression probably isn't going to do what you expect it to. The expression [snyder] is a character class, which will match one repetition of any character in the class. In other words, this will match the individual letters s, n, y, d, e, or r. If you want to match the literal string snyder then you should just use that as your expression: re.search("snyder", line, re.I). Or, if you don't want substring matches (cases where snyder might appear within another string), you can try the regex \bsnyder\b.


Edit re: your comment - Two things I'll point out here:

1) While [s][n][y][d][e][r] is semantically equivalent to snyder, you might want to consider using the latter for the sake of readability. A character class of one character is equivalent to that one character alone (as long as it's properly escaped and so forth if necessary). Yours will work, so that's just a suggestion/heads-up.

2) Try using re.findall() in place of re.search(). I think you'll get what you want with something like:

variations = []
for line in fileinput.input():
    found = re.findall(r"""snyder""", line, re.I)
    if len(found) > 0:
        variations += found
var_set = set(variations)
print var_set
print len(var_set)

An example of what this will do:

>>> print sl 
['blah', 'blah', 'what', 'is', 'this', 'BLAh', 'some', 'random', 'bLah', 'text', 'a longer BlaH string', 'a BLAH string with blAH two']
>>> li = []
>>> for line in sl:
...   m = re.findall("blah", line, re.I)
...   if len(m) > 0:
...     li += m
... 
>>> 
>>> print li   #Contains all matches
['blah', 'blah', 'BLAh', 'bLah', 'BlaH', 'BLAH', 'blAH']
>>> st = set(li)
>>> print st   #Contains only *unique* matches
set(['bLah', 'BLAH', 'BLAh', 'BlaH', 'blah', 'blAH'])
>>> print len(st)
6
>>> print len(li)
7    #1 greater than len(st) because st drops a non-unique match
eldarerathis
A: 

While you're at it, change that to:

pattern = re.compile('(snyder)')
[...]
name_count = pattern.search(line, re.I)

so that you're not re-compiling the regexp for each line in the input file.

Just Some Guy
Python caches recently-used regexes, so you are actually not re-compiling the regex each time. This might be a case of premature optimization. :-)
kindall