First I would say to take a look at this thread for a bit of information about reading from stdin
(if that's really what you want to do).
Second, I would consider just opening the file instead of reading from sys.stdin
, either using a library like fileinput
or a with
statement or other file handle.
Next, I would add that your regular expression probably isn't going to do what you expect it to. The expression [snyder]
is a character class, which will match one repetition of any character in the class. In other words, this will match the individual letters s, n, y, d, e, or r. If you want to match the literal string snyder
then you should just use that as your expression: re.search("snyder", line, re.I)
. Or, if you don't want substring matches (cases where snyder
might appear within another string), you can try the regex \bsnyder\b
.
Edit re: your comment - Two things I'll point out here:
1) While [s][n][y][d][e][r]
is semantically equivalent to snyder
, you might want to consider using the latter for the sake of readability. A character class of one character is equivalent to that one character alone (as long as it's properly escaped and so forth if necessary). Yours will work, so that's just a suggestion/heads-up.
2) Try using re.findall()
in place of re.search()
. I think you'll get what you want with something like:
variations = []
for line in fileinput.input():
found = re.findall(r"""snyder""", line, re.I)
if len(found) > 0:
variations += found
var_set = set(variations)
print var_set
print len(var_set)
An example of what this will do:
>>> print sl
['blah', 'blah', 'what', 'is', 'this', 'BLAh', 'some', 'random', 'bLah', 'text', 'a longer BlaH string', 'a BLAH string with blAH two']
>>> li = []
>>> for line in sl:
... m = re.findall("blah", line, re.I)
... if len(m) > 0:
... li += m
...
>>>
>>> print li #Contains all matches
['blah', 'blah', 'BLAh', 'bLah', 'BlaH', 'BLAH', 'blAH']
>>> st = set(li)
>>> print st #Contains only *unique* matches
set(['bLah', 'BLAH', 'BLAh', 'BlaH', 'blah', 'blAH'])
>>> print len(st)
6
>>> print len(li)
7 #1 greater than len(st) because st drops a non-unique match