ansaurus

Question

Regular expression isn't found in line: execution just hangs.

Answer 1

+3 A:

Its not hanging, its waiting for input on stdin! replace for line in sys.stdin by something like:

import fileinput

for line in fileinput.input("sample.txt"):

ennuikiller 2010-09-28 20:36:58

thanks! this worked for me.

amazinghorse24 2010-09-29 06:31:21

Answer 2

+4 A:

First I would say to take a look at this thread for a bit of information about reading from stdin (if that's really what you want to do).

Second, I would consider just opening the file instead of reading from sys.stdin, either using a library like fileinput or a with statement or other file handle.

Next, I would add that your regular expression probably isn't going to do what you expect it to. The expression [snyder] is a character class, which will match one repetition of any character in the class. In other words, this will match the individual letters s, n, y, d, e, or r. If you want to match the literal string snyder then you should just use that as your expression: re.search("snyder", line, re.I). Or, if you don't want substring matches (cases where snyder might appear within another string), you can try the regex \bsnyder\b.

Edit re: your comment - Two things I'll point out here:

1) While [s][n][y][d][e][r] is semantically equivalent to snyder, you might want to consider using the latter for the sake of readability. A character class of one character is equivalent to that one character alone (as long as it's properly escaped and so forth if necessary). Yours will work, so that's just a suggestion/heads-up.

2) Try using re.findall() in place of re.search(). I think you'll get what you want with something like:

variations = []
for line in fileinput.input():
    found = re.findall(r"""snyder""", line, re.I)
    if len(found) > 0:
        variations += found
var_set = set(variations)
print var_set
print len(var_set)

An example of what this will do:

>>> print sl 
['blah', 'blah', 'what', 'is', 'this', 'BLAh', 'some', 'random', 'bLah', 'text', 'a longer BlaH string', 'a BLAH string with blAH two']
>>> li = []
>>> for line in sl:
...   m = re.findall("blah", line, re.I)
...   if len(m) > 0:
...     li += m
... 
>>> 
>>> print li   #Contains all matches
['blah', 'blah', 'BLAh', 'bLah', 'BlaH', 'BLAH', 'blAH']
>>> st = set(li)
>>> print st   #Contains only *unique* matches
set(['bLah', 'BLAH', 'BLAh', 'BlaH', 'blah', 'blAH'])
>>> print len(st)
6
>>> print len(li)
7    #1 greater than len(st) because st drops a non-unique match

eldarerathis 2010-09-28 20:49:18

Answer 3

A:

While you're at it, change that to:

pattern = re.compile('(snyder)')
[...]
name_count = pattern.search(line, re.I)

so that you're not re-compiling the regexp for each line in the input file.

Just Some Guy 2010-09-28 21:24:33

Python caches recently-used regexes, so you are actually not re-compiling the regex each time. This might be a case of premature optimization. :-)

kindall 2010-09-28 21:40:43

ansaurus

tags:

views:

answers:

Regular expression isn't found in line: execution just hangs.

related questions