tags:

views:

146

answers:

5

The Greedy Option of Regex is really needed?

Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks

[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]

But with Regex Greedy Option, its give me

Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20

Anybody need like that? If yes, could you let me know?

+6  A: 

Regular expressions can potentially match multiple portion of a text.

For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:

   (abc)cababccc
   (abcc)ababccc
   abcc(ababccc)
   abccab(abccc)
   ab(c)cababccc
   ab(cc)ababccc
   abcabab(c)ccc
   ....

some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.

There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.

This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.

Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.

It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.

Remo.D
Thanks Remo.D, `abcc(ababccc)` can only match by greedy option, that make sense.
S.Mark
+7  A: 

If I understand correctly, the question is “why (when) do you need greedy matching?”

The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:

(.)\1+

(\1 is a back-reference that matches the same text as the first parenthesized expression).

Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.

By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.

In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.

Konrad Rudolph
Thx for explanation and example about RLE compression
S.Mark
+1  A: 

just use this algorithm which you can use in your fav language. No need regex.

flag=0
open file for reading
for each line in file :
    if check "[/Option" in line:
        flag=0
    if check "[Option" in line:
        flag=1
        continue
    if flag:
        print line.strip()
        # you can store the values of each option in this part
Thx for code, but the problem is not in the parsing, problem is in my mind, why greedy function is needed in regex.
S.Mark
why greedy function is needed?? hmm.. because somebody might want to use it??
Greediness in regex is a byproduct of the "longest leftmost match" rule. There's not a real "need" for it. I'll try to rephrase better my answer to make it more clear.
Remo.D
+2  A: 

If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".

This is really rough, but works:

\[[^\]]+]([^(\[/)]+)

The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.

Skilldrick
Indeed. Any time you find yourself using `.` with a quantifier in a regex, stop and ask yourself whether you *really* mean "any character". 99% of the time, you don't and you should be using a negated character class instead. Proper use of negated character classes eliminates almost all greediness issues.
Dave Sherohman
@Dave Exactly, much better put!
Skilldrick
+1  A: 

One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:

With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.

Let's say you apply a.*c to abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.

On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.

So if you know your data, you can tailor your regex to match it as efficiently as possible.

Tim Pietzcker