tags:

views:

224

answers:

4

Disclaimer: I'm not a regex expert.

I'm using Python re module to perform regex matching on many htm files. One of the patterns is something like this:

<bla><blabla>87765.*</blabla><bla>

The problem I've encountered is that instead of finding all (say) five occurrences of the pattern, it will find only one. Because it welds all the occurrences into one, using the <bla><blabla>87765 part of the first occurrence and the </blabla><bla> part of the last occurrence in the page.

Is there any way to tell re to find the smallest match?

+5  A: 

You can use a reluctant qualifier in your pattern (for more details, reference the python documentation on the *?, +?, and ?? operators):

<bla><blabla>87765.*?</blabla><bla>

Or, exclude < from the possible matched characters:

<bla><blabla>87765[^<]*</blabla><bla>

only if there are no children tags between <blabla> and </blabla>.

iammichael
The `[^<]` suggestion will fail if there are children tags between `<blabla>` and `</blabla>`.
ΤΖΩΤΖΙΟΥ
A: 

Um... there is a way to tell re to find the smallest match, and it's precisely by using non-greedy quantifiers.

<bla><blabla>87765.*?</blabla><bla>

I can't imagine why you would want to do it while using greedy quantifiers.

David Zaslavsky
Dave Berk
+2  A: 

The Python re module supports nongreedy matching. You just add a ? to the end of the wildcard pattern, such as .*?. You can learn more at this HOWTO.

Gordon Worley
+1  A: 
I believe the regex
<bla><blabla>87765.*?</blabla><bla>
can produce catastrophic backtracking.

Instead, use:
<bla><blabla>87765[^<]*</blabla><bla>

Using atomic grouping (I'm not sure Python supports this), 
the above regex becomes 
<bla><blabla>(?>(.*?<))/blabla><bla>

Everything between (?> ... ) is treated as one single token by the regex engine, once the regex engine leaves the group. Because the entire group is one token, no backtracking can take place once the regex engine has found a match for the group. If backtracking is required, the engine has to backtrack to the regex token before the group (the caret in our example). If there is no token before the group, the regex must retry the entire regex at the next position in the string. Note that I needed to include the "<" in the group to ensure atomicity. Close enough.

Dragos Toader
With only the one quantifier and no alternation, I don't see any potential for catastrophic backtracking. The `.*?` version might be less efficient than `[^<]*` but you'd probably never notice the difference. And, no, Python does not support atomic groups.
Alan Moore