ansaurus

Question

Why does this regular expression not match adjacent occurences of newline?

Answer 1

+1 A:

You could simply replace \r?\n with \r\n:

s = s.gsub(/\r?\n/, "\r\n")

That way, all \r\n's and \n's are replaced by \r\n.

Bart Kiers 2009-11-13 14:17:37

Ooh thanks. I will replace the code with that as it removes the dependency on the Oniguruma gem but am still interested in why my other regexp didn't work as I thought it would.

mikej 2009-11-13 14:20:52

The reason why is that in `One\n\nTwo` after the 1st `\n` is replaced by `\r\n` there is no character left before the 2nd '\n` to satisfy the `[^\r]` in your expression.

rsp 2009-11-13 14:31:28

You could make your original approach work by matching the linefeed in a lookahead: `s.gsub!(/([^\r])(?=\n)/, "\\1\r")`. But it still fails if `\n` is the very first character, so Bart's way is more correct as well as clearer.

Alan Moore 2009-11-13 15:55:50

@rsp if you post your comment in an answer I can vote it up, @Alan Moore thanks for additional thoughts

mikej 2009-11-13 16:16:59

Answer 2

+1 A:

Just writing to explain (rsp's comment says the same thing) why your original regex didn't work. The regex engine first matches ([^\r])\n at the ^ characters:

One\r\n\nTwo\r\nThree
   ^^^^

After the first replacement, the regex engine is at the ^:

One\r\n\nTwo\r\nThree
       ^

It's now trying to match ([^\r])\n, but there is no character that is not \n at the caret position. So it won't match until the \r\n between Two and Three.

Andomar 2009-11-13 15:48:02

my conclusion is it should be ([^\r])?\n to cover the case of no characters there?

Ape-inago 2009-11-13 15:53:00

@Ape-inago: `([^\r])?\n` would match any `\n`, even `\r\n`. Bart already posted the best solution I'd say.

Andomar 2009-11-13 16:00:01

So are you saying that after the first replacement the regexp cursor is at the 2nd \n as you showed which gets matched by the [^\r] and then the T of Two gets compared with \n which doesn't match? In effect the problem is caused because the length of the character positions have been changed by adding an extra character in the substitution part way through the matching?

mikej 2009-11-13 16:05:13

@mikej: Yes, `[^\r]` is a positive assertion: it says that there must be a character that's not `\r`. A negative assertion would say there may not be a character that's \r. In regex terms that's `(?<!\r)\n`, but not regex flavors support negative lookbehind.

Andomar 2009-11-13 16:30:32

ansaurus

tags:

views:

answers:

Why does this regular expression not match adjacent occurences of newline?

related questions