views:

42

answers:

3

I have a situation where I might be getting one or both of a pair of characters and I want to match either.

For example:

str = 'cddd a dfsdf b sdfg ab uyeroi'

I want to match any "a" or "b" or "ab". If the "ab" comes together I want to catch it as a single match (not as two matches "a" "b"). If I get "ab" it will always be in that order ("a" will always precede "b")

What I have is:

/[ab]|ab/

But I'm not sure if the ab is going to be a stronger match term than the [ab].

Thanks for the assistance.

+5  A: 

Your current expression will not do what you want in most popular regular expression engines - it will match a or b. The behaviour depends on the implementation of the regex engine:

You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by applying the regex regex|regex not to the string regex not. If the resulting match is only regex, the engine is regex-directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is "eager".

If you are using a regex-directed engine then to fix it you could reverse the order of the terms in the alternation to ensure it attempts to match ab first:

/ab|[ab]/

Or you could rewrite the expression so that the order doesn't matter:

/ab?|b/
Mark Byers
I'd already swapped the expression after my post (obvious error). Is it true then that matches always attempt from left to right within the regexp if there is a potential for more than one or'ed statement to match? I do like the second approach too.
James Fassett
@James Fassett: Well it might depend on the particular regular expression engine you are using, but all the ones I've ever seen match from left to right in alternations. There is some disussion of this behaviour here: http://www.regular-expressions.info/alternation.html (see the part about eager).
Mark Byers
Great link - thanks Mark.
James Fassett
@James Fassett: I've added more info to my answer too - there are some regular expression engines that don't use left-to-right matching.
Mark Byers
@Mark Byers: I'll do some tests with the engine I am using. Either way I'll probably go with the second option.
James Fassett
A: 

This works:

((ab)|a|b)

or better

(ab|[ab])
Floyd
A: 

You have your instances of a, b, and ab as words. Do you need to find them only as whole words? If so, you should try

/\bab\b|\b[ab]\b/

Robusto
I don't need to match on boundaries. I just wanted to make the characters in the example stand out. Thanks.
James Fassett