tags:

views:

338

answers:

5

A little regex help please.

Why are these different?

Regex.Replace("(999) 555-0000 /x ext123", "/x.*|[^0-9]", String.Empty)
"9995550000"


Regex.Replace("(999) 555-0000 /x ext123", "[^0-9]|/x.*", String.Empty)
"9995550000123"

I thought the pipe operator did not care about order... or maybe there is something else that can explain this?

A: 

You are missing some parenthesis. :)

leppie
Where? This looks fine to me, it's a case of executing things in different order, not executing things incorrectly.
Matthew Scharley
"/x.*|[^0-9]" is equivalent to "/x(?:.*|[^0-9])" while Jeff probably wants "(?:/x.*|[^0-9])".
Guffa
Nope. | is a higher precedence than literals in regular expressions.
Billy ONeal
For the sake of argument, I took this to LinqPad and I got the exact same results as reported with and without parenthesis around everything.
Matthew Scharley
@Guffa: No, it's equivalent to `"(?:/x.*|[^0-9])"`; there's no need for parentheses. @BillyONeal: `|` has **lower** precedence than everything else.
Alan Moore
+6  A: 

If I took a wild guess, I'd say that it's running the first part of the expression first, and then the second part. So, what's happening in the second case is that it's removing all the non-numeric parts, which means that the second part will never match, and leaves you with the extension intact.

Since it has to run some part of the expression first, since it can't run both at the same time, I'd say this is a fairly natural assumption, though I can see why you might get caught... Definitely an interesting gotcha though.

EDIT: To address the wording, as Ben rightly pointed out, the expression is attempted to be matched starting with each character in the string. So, what happens in the second case is:

  • There is no "^" anchor, so we try at the start of each substring:
  • For "(999) 555-0000 /x ext123", "(" matches [^0-9], so replace that with nothing (remove it).
  • For "999) 555-0000 /x ext123", the "999" part doesn't match [^0-9], nor does it match /x.*, so we keep trying from the ")", which matches [^0-9], so we remove it.
  • And so on. When it gets to the "/", the same thing happens, it matches [^0-9] and is removed, meaning the second part of the regex can never, ever match.

In the first case, what happens is the following:

  • Again, no "^" anchor, so we try for all substrings:
  • For "(999) 555-0000 /x ext123", "(" does not match /x.*, but it does match [^0-9], so replace that with nothing (remove it).
  • For "999) 555-0000 /x ext123", the "999" part doesn't match /x.*, nor does it match [^0-9], so we keep trying from the ")", which doesn't match /x.*, but which matches [^0-9], so we remove it.
  • When we hit the "/x", this time /x.* does match, it matches "/x ext123", and the rest of the string is removed, leaving us with nothing to continue with.
Matthew Scharley
A minor quibble; "running" the expression makes it sound (IMO) like the first alternative is being run against the entire string, then the second alternative. In fact, the each alternative is tested before moving on to the next character. i.e. The order of testing is "ABABAB", not "AAABBB". It makes no difference for this particular example, but in other cases, the distinction is vital. :-)
Ben Blank
Expanded quite a bit to explain exactly what's happening and to make your point a bit clearer. Thankyou for pointing that out.
Matthew Scharley
Thanks for the explanation. To paraphrase your answer, I gather that the alternation comparison is "order" sensitive, which is the core of my problem.
Jeff Meatball Yang
A: 

The alternation operator ( | ) attempts expressions in the order provided. In the first example, it tries to match the expression /x.* first, then it tries to match [^0-9].

Because the string " /x ext" can be matched by the first expression [^0-9], as in the second example, the second part of the expression, /x.* is never invoked.

Billy3

EDIT: More information here on the alternation operator: http://www.regular-expressions.info/alternation.html

Billy ONeal
A: 

The problem in your expression is that /x.* matches greedily. In the first expression it is provided first, so the engine attempts to match it first, resulting in matching all of the rest of the string after /x too because of the .* .

If you change it to /x.*? you get the same result as the second expression. The ? after the * tells the regex engine to match non-greedy.

Check http://www.regular-expressions.info/repeat.html to learn more about greediness.

Schtibe
`.*?` at the end of a regular expression (Which is what you'd effectively be doing) is useless, because a nongreedy multiplier that is 0+ will be 0 if possible, and if it's at the end of an expression, it is ALWAYS possible (unless you have a `$` anchor, in which case greedy or non-greedy makes no difference)
Matthew Scharley
I don't quite get what you mean. If in the first case you change "/x.*|[^0-9]" to "/x.*?|[^0-9]" you get the correct result, try it out! (I did so with regex buddy)The reason ist that if the engine finds a .* it tries to match that as often as possible, which in that case is the whole string after /x .
Schtibe
@Schtibe: It leaves the "123" at the end like the other regex, but why do you assume that's the correct result? If that's what he wanted, `[^0-9]` would do the job by itself. The `/x.*` part only makes sense if it's supposed to match the rest of the string, *including* the digits. Anyway, like Matthew said, `.*?` never makes sense at the end of a regex. It will always match zero characters, because there's nothing to force it to match more.
Alan Moore
OK yes you're both right! I'm sorry.
Schtibe
+2  A: 

I think you've got the wrong idea about alternation (i.e., the pipe). In a pure DFA regex implementation, it's true that alternation favors the longest match no matter how the alternatives are ordered. In other words, the whole regex, whether it contains alternation or not, always returns the earliest and longest possible match--the "leftmost-longest" rule.

However, the regex implementations in most of today's popular programming languages, including .NET, are what Friedl calls Traditional NFA engines. One of the most important differences between them and DFA engines is that alternation is not greedy; it attempts the alternatives in the order they're listed and stops as soon as one of them matches. The only thing that will cause it to change its mind is if the match fails at a later point in the regex, forcing it to backtrack into the alternation.

Note that if you change the [^0-9] to [^0-9]+ in both regexes you'll get the same result from both--but not the one you want. (I'm assuming the /x.* alternative is supposed to match--and remove--the rest of the string, including the extension number.) I'd suggest something like this:

"[^0-9/]+|/x.*$"

That way, neither alternative can even start to match what the other one matches. Not only will that will prevent the kind of confusion you're experiencing, it avoids potential performance bottlenecks. One of the other major differences between DFA's and NFA's is that badly-written NFA's are prone to serious (even catastrophic) performance problems, and sloppy alternations are one of the easiest ways to trigger them.

Alan Moore