tags:

views:

95

answers:

4

Hello,

I've got a regular expression that looks something like this:

a(|bc)

this expression matches perfectly a String "a", but it doesnt match "abc". What does the expression in the parenthesis mean?

Edit: Using C# with the following code:

Match m = Regex.Match(TxtTest.Text, TxtRegex.Text);
if (m.Success)
  RtfErgebnis.Text = m.Value;
else
  RtfErgebnis.Text = "Gültig, aber kein Match!";

"TxTTest" contains the string to test (in this case "abc"). "TxtRegex" contains the regular expression (in this case "a(|bc)")

"RtfErgebnis" shows "Gültig, aber kein Match!" which means, the regex is valid but the given teststring did not match.

On a side note:

The expression

a(|bc)d

matches "ad" aswell as "abcd". So why does the previous expression not match "abc"?

I have no influence on the regular expression I will get. I just stumbled upon this special case. I need to know how to handle it for regex parsing and data generation.

Edit 2:

"RtfErgebnis" shows "Gültig, aber kein Match!" which means, the regex is valid but the given teststring did not match.

I had a little error on the parameters passed, so now it shows "a", which is completely right.

+3  A: 

The empty branch in (|bc) matches anything but doesn’t consume a character as an empty expression does not describe any character.

Swap the branches and you will get the “longest” match:

a(bc|)

This will match abc in abc (bc branch taken) but also a in ax (empty branch taken).

Gumbo
You are right aswell, but the following answer explained it clearer to me.
Aurril
@Aurril, I assume by "the following answer" you mean the one posted by David Hedlund. Answers don't appear in a fixed order here, so you need to be a little more specific. Welcome to SO!
Alan Moore
The moment I noticed this fact, it was already too late to edit my comment.
Aurril
+1  A: 

Actually a(|bc) does match abc

perl -n -e 'print "Output:$_" if /a(|bc)/; '
a
Output:a
abc
Output:abc
bc

Therefore there is no inconsistent behaviour between a(|bc) and a(|bc)d

Paul
This is interesting, as it doesn't do this in C#.
Aurril
@Aurril Some caution is indicated. In the perl print statement $_ is the supplied string. $1 would be the first () regexp.match. I am printing the entire string, but in the first case the match is empty and the second case the match would be bc. In the third the match would be empty because there is no "a" to activate the test for nothing or bc.
Paul
+5  A: 

The pipe means "or". Your first expressions says "a, followed by nothing or bc". Hence, "a" is a full match, and it doesn't bother to include "bc".

The second expression says "a, followed by nothing or bc, followed by d". In that version, a match is only complete when it selects everything all the way trough to "d".

If you want it to prefer the "bc" option over the nothing option, you could rewrite your expression as such:

a(bc)?

which means, "a, followed by zero or one occurrence of bc", in which case most engines will treat "abc", rather than, "a", as the full match.

David Hedlund
Thanks, this ist it. Now I know how to handle this expression.I would vote you up, if I had the reputation.
Aurril
you do have the reputation ;) it takes 15 rep to vote up, and you've got 23. anyhow, i'm happy it worked out for you
David Hedlund
On a second thought, following you explanation, how comes that "abc" is not matched at all by "a(|bc)", not even the "a"?@Vote up: now I have the reputation, so I voted your answer up
Aurril
it does match the "a". if you've got further problems, I think you'll need to show some more of what you're doing
David Hedlund
some examples, in js: http://jsbin.com/axehu/
David Hedlund
Edited my question to include a code sample of how I retrieve a match.
Aurril
most likely, it is your values that are wrong. i just tested it, and this outputs 'true': `Console.WriteLine(Regex.Match("abc", "a(|bc)").Success)`. so debug your code and have a look at what you're *really* passing to `Match`. please note that regex is case sensitive unless you pass `RegexOptions.IgnoreCase` to it.
David Hedlund
You are right, I had a little mistake (did some previous checking and replacing on the regex) which I forgot. Now it outputs "a" which totally conforms with your explanation. Thank you!
Aurril
+1  A: 

Whether the (|ab) returns a match of "" or "ab" for this match group is dependent on the ordering of your match and probably dependent on the regular expression engine being used as well. For example in grep and sed, this only matches ab if the order is reversed (ab|):

echo abc | sed -n 's/a\(\bc\|\)/\1/p'

The above returns:

bc

And the following (|ab) returns nothing:

echo abc | sed -n 's/a\(\\|bc\)/\1/p'
Trey