views:

137

answers:

4

I very new to Python, and fairly new to regex. (I have no Perl experience.)

I am able to use regular expressions in a way that works, but I'm not sure that my code is particularly Pythonic or consise.

For example, If I wanted to read in a text file and print out text that appears directly between the words 'foo' and 'bar' in each line (presuming this occurred one or zero times a line) I would write the following:

fileList = open(inFile, 'r')
pattern = re.compile(r'(foo)(.*)(bar)')
for line in fileList:
    result = pattern.search(line)
    if (result != None):
        print result.groups()[1]

Is there a better way? The if is necessary to avoid calling groups() on None. But I suspect there is a more concise way to obtain the matching String when there is one, without throwing errors when there isn't.

I'm not hoping for Perl-like unreadability. I just want to accomplish this common task in the commonest and simplest way.

+3  A: 

I think it's fine.

Some minor points:-

  • You can replace result.groups()[x] with result.group(x+1).
  • If you don't need to capture foo and bar, just use r'foo(.*)bar'.
  • If you're using Python 2.5+, try to use the with statement so even when there's exception the file can be closed properly.

BTW, as a 5-liner (not that I recommend this):

import re
pattern = re.compile(r'foo(.*)bar')
with open(inFile, 'r') as fileList:
  searchResults = (pattern.search(line) for line in fileList)
  groups = (result.group(1) for result in searchResults if result is not None)
  print '\n'.join(groups)
KennyTM
For some reason, `result.group(1)` captures `foo` for me, but `result.group(2)` works.
FarmBoy
@FarmBoy: Because you are matching with `(foo)(.*)(bar)` instead of `foo(.*)bar`.
KennyTM
Wouldn't the index of a tuple be 0-based? I was expecting `result.group(0)` would return `foo` in my code.
FarmBoy
@Farm: Yes the index is 0-based, but the 0th group is the whole match (e.g.. `fooblahblahblahbar`). See http://docs.python.org/library/re.html#re.MatchObject.group for detail.
KennyTM
+1  A: 

you don't need regex. split your string on "bar", iterate them, find "foo", do a split on "foo" and get the results to the right. Of course, you can use other string manipulation like getting the index and stuff.

>>> s="w1 w2 foo what i want bar w3 w4 foowhatiwantbar w5"
>>> for item in s.split("bar"):
...     if "foo" in item:
...         print item.split("foo")[1:]
...
[' what i want ']
['whatiwant']
ghostdog74
+1  A: 

There are two tricks to be had: the first is the re.finditer regular expression function (and method). The second is the use of the mmap module.

From the documentation on re.DOTALL, we can note that . does not match newlines:

without this flag, '.' will match anything except a newline.

So if you look for all matches anywhere in the file (such as when read into a string using f.read()), you can pretend each line is an isolated substring (note: it's not quite true, though. If you want the ^ and $ assertions to work this way, use re.MULTILINE). Now, because you noted that we assume there are only zero or one occurrences per line, we don't have to worry about re.finditer() matching more than it should (because it would!). So right away, you could replace all that with iterating over finditer() instead:

fileList = open(inFile, 'r')
pattern = re.compile(r'foo(.*)bar')
for result in pattern.finditer(fileList.read()):
    print result.groups(1)

This isn't really nice though. The problem here is that the entire file is read into memory for your convenience. It'd be nice if there was a convenient way to do it without possibly breaking on larger files. And, well, there is! Enter the mmap module.

mmap lets you treat a file as if it were a string (a mutable string, no less!), and it doesn't load the whole thing into memory. The long and short of it is, you can use the following code instead:

fileList = open(inFile, 'r+b')
fileS = mmap.mmap(fileList.fileno(), 0)
pattern = re.compile(r'foo(.*)bar')
for result in pattern.finditer(fileS):
    print result.groups(1)

and it will work just the same, but without consuming the whole file at once (hopefully).

Devin Jeanpierre
A: 

I have a few minor suggestions:

  • Unless you're certain that foo and bar can occur no more than once per line, it's better to use .*? instead of .*
  • If you need to make sure that foo and bar should only be matched as entire words (as opposed to foonly and rebar), you should add \b anchors around them (\bfoo\b etc.)
  • You could use lookaround to match only the match itself ((?<=\bfoo\b).*?(?=\bbar\b)), so now result.group(0) will contain the match. But that's not really more readable :)
Tim Pietzcker