views:

1675

answers:

2

Some Context

From Javascript: The Definitive Guide:

When regexp is a global regular expression, however, exec() behaves in a slightly more complex way. It begins searching string at the character position specified by the lastIndex preperty of regexp. When it finds a match, it sets lastIndex to the position of the first character after the match.

I think anyone who works with javascript RegExps on a regular basis will recognize this passage. However, I have found a strange behavior in this method.

The Problem

Consider the following code:

>> rx = /^(.*)$/mg

>> tx = 'foo\n\nbar'

>> rx.exec(tx)
[foo,foo]
>> rx.lastIndex
3
>> rx.exec(tx)
[,]
>> rx.lastIndex
4
>> rx.exec(tx)
[,]
>> rx.lastIndex
4
>> rx.exec(tx)
[,]
>> rx.lastIndex
4

The RegExp seems to get stuck on the second line and doesn't increment the lastIndex property. This seems to contradict The Rhino Book. If I set it myself as follows it continues and eventually returns null as expected but it seems like I shouldn't have to.

>> rx.lastIndex = 5
5
>> rx.exec(tx)
[bar,bar]
>> rx.lastIndex
8
>> rx.exec(tx)
null

Conclusion

Obviously I can increment the lastIndex property any time the match is the empty string. However, being the inquisitive type, I want to know why it isn't incremented by the exec method. Why isn't it?

Notes

I have observed this behavior in Chrome and Firefox. It seems to happen only when there are adjacent newlines.

[edit]

Tomalak says below that changing the pattern to /^(.+)$/gm will cause the expression not to get stuck, but the blank line is ignored. Can this be altered to still match the line? Thanks for the answer Tomalak!

[edit]

Using the following pattern and using group 1 works for all strings I can think of. Thanks again to Tomalak.

/^(.*)((\r\n|\r|\n)|$)/gm

[edit]

The previous pattern returns the blank line. However, if you don't care about the blank lines, Tomalak gives the following solution, which I think is cleaner.

/^(.*)[\r\n]*/gm

[edit]

Both of the previous two solutions get stuck on trailing newlines, so you have to either strip them or increment lastIndex manually.

[edit]

I found a great article detailing the cross browser issues with lastIndex over at Flagrant Badassery. Besides the awesome blog name, the article gave me a much more in depth understanding of the issue along with a good cross browser solution. The solution is as follows:

var rx = /^/gm,
    tx = 'A\nB\nC',
    m;

while(m = rx.exec(tx)){
    if(!m[0].length && rx.lastIndex > m.index){
        --rx.lastIndex;
    }

    foo();

    if(!m[0].length){
        ++rx.lastIndex;
    }
}
+5  A: 

The problem is that the dot in

^(.*)$

does not match new line characters, but with your "m" switch you make "^" and "$" anchor to new line characters. That means the "nothing" between "\n" and "\n" can be matched successfully with "(.*)".

Since this match is of zero width, the lastIndex property cannot advance. Try:

^(.+)$

EDIT: To match the blank lines as well, do this:

^(.*)\n?     // remove all \r characters beforehand

or

^(.*)(?:\r\n|\n\r|\n|\r)?  // all possible CR/LF combinations, but *once* at most

...and just go for match group 1.

Tomalak
This works, but effectively ignores the blank line. I want to match it, just not get stuck.
brad
Changed my answer accordingly.
Tomalak
Edited regex #3 again to account for the last line in the string which was previously not matched.
Tomalak
Nice! I don't think \n\r is ever used though.Wikipedia-Newlines:http://en.wikipedia.org/wiki/Newline
brad
Yes, I don't either. But with custom-generated strings/documents you never know. Might well be that someone got it wrong in is app. I have to look it up often enough when using Chr(13) and Chr(10), somehow it just doesn't stick.
Tomalak
+1  A: 

The problem with lastIndex is that a JavaScript implementation that follows the standard to the letter sets it to the offset of the next character after the match. For regular expressions, like yours, that allow zero-length matches, exec() will thus get stuck in an infinite loop when a zero-length match is found. The next match attempt will begin at the same position, where the same zero-length match is found.

Traditionally, regex engines deal with this by skipping one character when a zero-length match is found. Incidentally, Internet Explorer does this as well.

I've blogged about this in detail in the past: Watch Out for Zero-Length Matches

Jan Goyvaerts