ansaurus

Question

Answer 1

+3 A:

I think it's fine.

Some minor points:-

You can replace result.groups()[x] with result.group(x+1).
If you don't need to capture foo and bar, just use r'foo(.*)bar'.
If you're using Python 2.5+, try to use the with statement so even when there's exception the file can be closed properly.

BTW, as a 5-liner (not that I recommend this):

import re
pattern = re.compile(r'foo(.*)bar')
with open(inFile, 'r') as fileList:
  searchResults = (pattern.search(line) for line in fileList)
  groups = (result.group(1) for result in searchResults if result is not None)
  print '\n'.join(groups)

KennyTM 2010-03-29 08:53:38

For some reason, `result.group(1)` captures `foo` for me, but `result.group(2)` works.

FarmBoy 2010-03-29 10:12:16

@FarmBoy: Because you are matching with `(foo)(.*)(bar)` instead of `foo(.*)bar`.

KennyTM 2010-03-29 12:46:15

Wouldn't the index of a tuple be 0-based? I was expecting `result.group(0)` would return `foo` in my code.

FarmBoy 2010-03-29 16:01:35

@Farm: Yes the index is 0-based, but the 0th group is the whole match (e.g.. `fooblahblahblahbar`). See http://docs.python.org/library/re.html#re.MatchObject.group for detail.

KennyTM 2010-03-29 16:11:50

Answer 2

+1 A:

you don't need regex. split your string on "bar", iterate them, find "foo", do a split on "foo" and get the results to the right. Of course, you can use other string manipulation like getting the index and stuff.

>>> s="w1 w2 foo what i want bar w3 w4 foowhatiwantbar w5"
>>> for item in s.split("bar"):
...     if "foo" in item:
...         print item.split("foo")[1:]
...
[' what i want ']
['whatiwant']

ghostdog74 2010-03-29 08:58:04

Answer 3

+1 A:

There are two tricks to be had: the first is the re.finditer regular expression function (and method). The second is the use of the mmap module.

From the documentation on re.DOTALL, we can note that . does not match newlines:

without this flag, '.' will match anything except a newline.

So if you look for all matches anywhere in the file (such as when read into a string using f.read()), you can pretend each line is an isolated substring (note: it's not quite true, though. If you want the ^ and $ assertions to work this way, use re.MULTILINE). Now, because you noted that we assume there are only zero or one occurrences per line, we don't have to worry about re.finditer() matching more than it should (because it would!). So right away, you could replace all that with iterating over finditer() instead:

fileList = open(inFile, 'r')
pattern = re.compile(r'foo(.*)bar')
for result in pattern.finditer(fileList.read()):
    print result.groups(1)

This isn't really nice though. The problem here is that the entire file is read into memory for your convenience. It'd be nice if there was a convenient way to do it without possibly breaking on larger files. And, well, there is! Enter the mmap module.

mmap lets you treat a file as if it were a string (a mutable string, no less!), and it doesn't load the whole thing into memory. The long and short of it is, you can use the following code instead:

fileList = open(inFile, 'r+b')
fileS = mmap.mmap(fileList.fileno(), 0)
pattern = re.compile(r'foo(.*)bar')
for result in pattern.finditer(fileS):
    print result.groups(1)

and it will work just the same, but without consuming the whole file at once (hopefully).

Devin Jeanpierre 2010-03-29 09:08:16

Answer 4

A:

I have a few minor suggestions:

Unless you're certain that foo and bar can occur no more than once per line, it's better to use .*? instead of .*
If you need to make sure that foo and bar should only be matched as entire words (as opposed to foonly and rebar), you should add \b anchors around them (\bfoo\b etc.)
You could use lookaround to match only the match itself ((?<=\bfoo\b).*?(?=\bbar\b)), so now result.group(0) will contain the match. But that's not really more readable :)

Tim Pietzcker 2010-03-29 09:10:35

ansaurus

tags:

views:

answers:

How to improve my Python regex syntax?

related questions