views:

69

answers:

3
<html><body><script>
var matches = /(\w+)(\s*(\w+))?/.exec("aaa");
alert(matches.length);
alert(typeof(matches[3]));
</script></body><html>

I'm really new to regular expressions, so this may be a very easy question.

The regular expression above /(\w+)(\s*(\w+))?/ matches patterns like "aaa", "123", "my_var" or "aaa bbb", "123 456", "my_var my_value".

For an expression like "aaa bbb", matches = ["aaa bbb", "aaa", " bbb", "bbb"], but for an expression like "aaa", matches = ["aaa", "aaa", ???, ???]

The first thing that surprised me is that matches.length = 4. I was expecting it to be 2, but I don't see any document explaining what it should be. How does it work?

And the second thing that surprised me is that the 2 "extra" matches that I got are working different in the 2 browsers I've tested this into:

  • In Firefox 3.6.3, matches[2] and matches[3] are undefined.

  • In Internet Explorer 6, matches[2] and matches[3] are an empty string.

Basically, how should I check if I've got a "short" (like "aaa") or a "long" (like "aaa bbb") expression?

+3  A: 

The matches array contains two kinds of matches, the whole matched string, and your embraced patterns. So in this case it has four elements, the total matched string, "aaa", the first sub result, "aaa" again, and both (\s*(\w+)) and (\w+) have empty matches.

The difference between firefox and IE is trivial.

The answer to how you should check the match results is simple, just check the value of matches[1] and matches[3], see if they're undefined or empty. If your strings to parse are all in pattern of \w+\s*\w+, just String.split() them will be fine. The result array will be short if your string is short and will be long if your string is "aaaa bbbb". Be careful with cases like "aaa " though.

nil
About the 1st question, I was expecting it to have 2 matches instead of 4. But having it return a match for every parenthesis kind of makes sense.The 2nd question is the real doubt. Since I have seen 2 different behaviors, I don't know what is it supposed to be, nor I have seen it documented. if (matches[2]) seems to work in both, but I'd like to see some documentation for this. I can't use String.split(), since the real regular expression is much longer and complex than that one.
GameZelda
@GameZelda "", 0 and undefined are all false values in JavaScript. If you are afraid of this behavior, check both conditions instead. e.g. `if (type of matches[2] === "undefined" || matches[2] === "")`
nil
+2  A: 

The standard (ECMAScript 5) is pretty clear. The length should be 4, and IE is wrong (shocking, I know).

From §15.10.2.1, "NcapturingParens is the total number of left capturing parentheses." You have 3.

"A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is an internal array of NcapturingParens values. [...] The nth element of captures is either a String that represents the value obtained by the nth set of capturing parentheses or undefined if the nth set of capturing parentheses hasn’t been reached yet."

§15.10.6.2, which describes exec, says:

9 . d. i. Let r be the State result of the call to [[Match]]. [...]

12 . Let n be the length of r's captures array. (This is the same value as 15.10.2.1's NCapturingParens.)

13 . Let A be a new array created as if by the expression new Array() [...]

17 . Call the [[DefineOwnProperty]] internal method of A with arguments "length", Property Descriptor {[[Value]]: I + 1}, and true. [...]

20 . For each integer i such that I > 0 and I ≤ n

a. Let captureI be ith element of r's captures array.

b. Call the [[DefineOwnProperty]] internal method of A with arguments ToString(i), Property Descriptor {[[Value]]: captureI, [[Writable]: true, [[Enumerable]]: true, [[Configurable]]: true}, and true.

21 . Return A.

So the length should definitely be 4 (3 + 1), and captures that don't get reached (like (\s*(\w+)) in your pattern) remain undefined. Luckily, undefined and "" (empty string) are both falsy. This means that they are false when treated as a boolean. So you can work around IE's bug by doing if(matches[2])

Matthew Flaschen
This was exactly what I wanted to see :)
GameZelda
+2  A: 

Try it with these two regexes:

var m1 = /(\w+)(\s*)/.exec("aaa");   // ["aaa", "aaa", ""]
var m2 = /(\w+)(\s+)?/.exec("aaa");  // ["aaa", "aaa", undef]

In the first case, group #2 doesn't consume any characters, but the * means a zero-length match is okay; that group is said to have matched nothing--i.e., an empty string. In the second case, (\s+) fails, but the overall match succeeds because the group itself was optional. The undef result indicates that the group did not participate in the match.

That's how it's supposed to work: an empty string means the group participated in the match but didn't consume any characters; undef means it didn't participate in the match. By returning an empty string for non-participating groups, Internet Explorer erases the distinction between a group that matches nothing and a group that doesn't match.

The situation is a lot worse than that, though, and IE is not the only bad guy; see this blog post for the gory details.

But there's one thing that all browsers agree on: the number of elements in the match array is controlled by the number of capturing groups in the regex, whether they participate in the match or not.

Alan Moore