views:

133

answers:

3

On my OS X 10.5.8 machine, using the regcomp and regexec C functions to match the extended regex "(()|abc)xyz", I find a match for the string "abcxyz" but only from offset 3 to offset 6. My expectation was that the entire string would be matched and that I would see a submatch for the initial "abc" part of the string.

When I try the same pattern and text with awk on the same machine, it shows a match for the entire string as I would expect.

I expect that my limited experience with regular expressions may be the problem. Can somebody explain what is going on? Is my regular expression valid? If so, why doesn't it match the entire string?

I understand that "((abc){0,1})xyz" could be used as an alternative, but the pattern of interest is being automatically generated from another pattern format and eliminating instances of "()" is extra work I'd like to avoid if possible.

For reference, the flags I'm passing to regcomp consist only of REG_EXTENDED. I pass an empty set of flags (0) to regexec.

A: 

If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.

wrang-wrang
Iterating over all matches gives me [3,6), [3,3), and [3,3). The first one is the match for the regex as a whole according to the regexec man page.
Eric
A: 

Try (abc|())xyz - I bet it'll produce the same result in both places. I can only presume that the C version is trying to match xyz wherever it can, and if that fails, it tries to match abcxyz wherever it can (but, as you see, it doesn't fail, so we never bother with the "abc" part) whereas awk must be using it's own regex engine that performs the way you expect.

Your regex is valid. I think the problem is either a) POSIX isn't very clear about how the regex should work, or b) awk isn't using 100% POSIX-compliant regexes (probably because it appears OS X ships with a more original version of awk). Whichever problem it is, it's probably caused because this is somewhat of an edge case and most people wouldn't write the regex that way.

Chris Lutz
Interesting idea! I tried using "(abc|())xyz" and it did the match as I expected; returning [0,6) for the whole regex and [0,3) for the submatch.As I understand the standard, it should always use the longest match among all candidates. So if you're right about what's happening, I think it's a bug.
Eric
It sounds like a bug. All I can do is try explain it unless we peek into glibc's implementation, which I don't feel like doing because I feel fairly confident that we know what's going on in there based on the unexpected output. Perhaps you should file a bug report with the glibc authors (or test it on another computer with GCC, to see if it might just be an Apple-only/older version of glibc-only problem)
Chris Lutz
It's not a bug - it is undefined behaviour behaving in an undefined manner.
Jonathan Leffler
+2  A: 

The POSIX standard says:

9.4.3 ERE Special Characters

An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:

.[\(

The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.

What you are seeing is the result of invoking undefined behaviour - anything goes.

If you want reliable, portable results, you will have to eliminate the empty '()' notations.

Jonathan Leffler
Yeah, I think the best choice is to avoid using `()`. Although my system does define the behavior I wanted in its `re_format(7)` man page, the thing to do is stick to POSIX. Thanks for digging up the reference.
Eric