views:

104

answers:

4

Have a regex:

.*?
(rule1|rule2)
(?:(rule1|rule2)|[^}])*

(It's designed to parse CSS files, and the 'rules' are generated by JS.)

When I try this in IE, all works as it should. Ditto when I try it in RegexBuddy or The Regex Coach.

But when I try it in Firefox or Chrome, the results are missing values.
Can anyone please explain what the real browsers are thinking, or how I can achieve results similar to IE's?

To see this in action, load up a page that gives you interactive testing, such as the W3Schools try-it-out editor.

Here's the source that can be pasted in: http://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_regexp_exec

<html>
<body>

<script type="text/javascript">

var str="#rot { rule1; rule2; }";

var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;

var result=patt.exec(str);
for(var i = 0; i < 3; i++) document.write(i+": " + result[i]+"<br>"); 

</script>
</body>
</html>

Here is the output in IE:

0: #rot { rule1; rule2; 
1: rule1
2: rule2

Here is the output in Firefox and Chrome:

0: #rot { rule1; rule2; 
1: rule1
2: undefined

When I try the same using string.match, I get back an array of undefined in all browsers, including IE.

var str="#rot { rule2; rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/gi;
var result=str.match(patt);
for(var i = 0; i < 5; i++) document.write(i+": "+result[i]+"<br>"); 

As far as I can tell, the issue is the last non-capturing parenthesis.
When I remove them, the results are consistent cross browser - and match() gets results.

However, it does capture from the last parenthesis, in all browsers, in the following example:

<script>
var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;
var result=patt.exec(str);
for(var i =0; i < 3; i++) document.write(i+": "+result[i]+"<br>"); 
</script>

Notice that I've added a space to the patterns in the second regex.
The same applies if I add any negative character to the strings in the second regex:

var patt=/.*?(rule1|rule2)(?:(rule1[^1]|rule2[^1])|[^}])*/gi;

What the expletive is going on?!
All other strings that I've tried result in the first set of non-catches. Any help is greatly appreciated!

EDIT: The code has been shortened, and many hours of research put in, on Mathhew's advice.
The title has been changed to make the thread easier to find.

I have marked Mathew's answer as correct, as it is well researched and described.
My answer below (written before Mathew revised his) states the logic in simpler and more direct terms.

A: 

Try removing the ?: at the front of lines 4 and 5 in your regex above. I haven't tested it, but it really looks like they don't belong there.

(?:^|})
([^{]+)
[^}]+?-moz-
((transform[^-][^;}]+)|(transform-origin[^;}]+))
(-moz-(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))|[^}])*
Ben Lee
All that means is that he doesn't want to capture that.
Matthew Flaschen
I know that's what it means. And it looks like it should capture that.
Ben Lee
They were intentional, but I removed them and greatly simplified the example, as in the question, it was just noise. Please have a look-see again, I'm quite exasperated!
SamGoody
+1  A: 

IE is wrong. In ECMAScript, exactly one alternative can result in a string. All the others have to be undefined (not "" or anything else).

So for your alternatives, including (transform[^-][^;}]+)|(transform-origin[^;}]+), Firefox and Chrome are correct in setting the failed capture to undefined.

There's an example in the ECMAScript 5 standard (§15.10.2.3) specifically about this:

NOTE The | regular expression operator separates two alternatives. The pattern first tries to match the left Alternative (followed by the sequel of the regular expression); if it fails, it tries to match the right Disjunction (followed by the sequel of the regular expression). If the left Alternative, the right Disjunction, and the sequel all have choice points, all choices in the sequel are tried before moving on to the next choice in the left Alternative. If choices in the left Alternative are exhausted, the right Disjunction is tried instead of the left Alternative. Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings.

Thus, for example, /a|ab/.exec("abc") returns the result "a" and not "ab". Moreover, /((a)|(ab))((c)|(bc))/.exec("abc") returns the array ["abc", "a", "a", undefined, "bc", undefined, "bc"] and not ["abc", "ab", undefined, "ab", "c", "c", undefined]

EDIT: I figured the last part out. This applies to the original as well as the simplified version. In both cases, rule1 and rule2 can't match the ; (in the original because ; is in the negated character class [^;}]). Thus, when a ; hit between declarations, the alternation chooses [^}]. Thus, it must set the last two captures to undefined.

For the * to be fully greedy, the final ; and space in the input must also be matched. For the last two * repetitions (';' and ' '), the alternation again chooses [^}], so the captures should be set undefined at the end too.

IE fails to do this in both cases, so they stay equal to "rule1" and "rule2".

Finally, the reason that the second example behaves differently is that (transform-origin[^;}]+)) matches on the very last * repetition, since there's no ; before the end.

EDIT 2: I'll walk through what should be happening both current examples. match is the match array.

var str="#rot { rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;

.*? - "#rot { "

(rule1|rule2) - "rule1"
match[1] = "rule1"

Star 1

[^}] - ";"
match[2] = undefined 

Star 2

[^}] - " "
match[2] = undefined 

Star 3

(rule1|rule2) - "rule2"
match[2] = "rule2"

Star 4

[^}] - ";"
match[2] = undefined 

Star 5

[^}] - " "
match[2] = undefined 

Again, IE isn't setting match[2] to undefined.

For the str.match example, you're using the global flag. That means it returns an array of matches, without captures. This applies to any use of String.match. If you use g, you have to use exec to get captures.

var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;

.*? - "#rot { "
(rule1|rule2) - "rule1"
match[1] = "rule1"

Star 1

[^}] - ";"
match[2] = undefined 

Star 2

[^}] - " "
match[2] = undefined 

Star 3

(rule1 |rule2 ) - "rule2 "
match[2] = "rule2 "

Since this is the last *, the capture never gets set to undefined.

Matthew Flaschen
Good point, though I really don't care if the response is undefined or empty, as much as I care that results that should have been caught not be ignored.
SamGoody
Thank you, but I don't think this works. Although the ; is in the negated char class, the capturing parenthesis should capture through - but not including the semicolon. The same for the greedy star.If you try with the current simlified example, you will see that the latter parenthesis do not capture anything, even if you remove the closing brace from the string and allow the capture to go all the way to the end.
SamGoody
@Sam, it does capture not including the semi-colon, but the capture gets undefined out later. I've walked through the first three examples above. By the way, since we're using so many examples, it would help to give them unique variable names to avoid confusion.
Matthew Flaschen
A: 

Your 4th and 5th patterns are competing. Ultimately it is up to the implementation of the browsers regex engine to determine the matches. This wouldn't be the first difference between IE and others.

(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))
(?:-moz-(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))|[^}])*

Both of these are prefixed by transform and suffixed by origin. You need to condense these into a more concise expression. Something like the following is an example:

((?:-moz-)?(?:transfrom-origin[^;}]+))
Jason McCreary
Please back up your argument that the standard allows undefined behavior.
Matthew Flaschen
@Matthew, I have removed *undefined behavior* from my answer as I agree it may have been misleading. Nonetheless, I believe this to be at least part of the OPs problem. After reviewing your answer, you seem to have the same belief.
Jason McCreary
@Jason, I also don't think it's "up to the implementation". I just think IE has a bug.
Matthew Flaschen
Those two rules were just a (poor) example of two CSS rules. They cannot be condensed. See fixes to question (where I simplified the example a lot). The two patterns are not competing though - it should catch the first pattern, and then the second. (Note the lazy quantifier at the beginning of the regex). This regex works fine in every regex tool at my disposal, including regexBuddy with the language set to JS
SamGoody
@Matthew, pretty critical of my wording. @SamGoody, sorry, but I still maintain that `(transform[^-][^;}]+)|(transform-origin[^;}]+)` both captures `transform-origin: somevalue`. Therefore they *compete*.
Jason McCreary
+2  A: 

There is a disagreement how to handle repeating capturing brackets.

Firefox and Webkit both make the following assumptions, IE makes only the first:

  1. If a parenthesis is repeated, capturing each time something new, only the last result is stored.
  2. If the parenthesis are inside a larger non capturing repeating parenthesis, and do not capture anything on the last loop, the parenthesis should capture nothing.

For example:

var str = 'abcdef';
var pat = /([a-f])+/;

pat.exec will catch an 'a', then replace it with a 'b' etc, until it returns an 'f'.
In all browsers.

var str = 'abcdefg';
var pat = /(?:([a-f])|g)+/;

pat.exec will first fill in the capturing parenthesis with an 'a', 'b', through 'f'.
But the non-capturing parent will then continue and match the 'g'. During which time there is nothing to go into the capturing parenthesis, so it is emptied.
And the regex will return a undefined string as its response.

IE considers the capturing parenthesis to have caught nothing in the last loop throup, and therefore sticks with the last valid response of 'f'.

Which is useful, but not logical.

Being illogically useful is more destructive than useful. (We all hate quirksmode.)
Advantage Firefox/Chrome.

SamGoody
Yeah, Firefox and Chrome comply with "Any capturing parentheses inside a portion of the pattern skipped by | produce `undefined` values instead of Strings." (see standard quote above), IE doesn't.
Matthew Flaschen