views:

151

answers:

2

I was trying to write a regexp to replace all occurrences of \n with \r\n unless the \n is already proceeded imediately by a \r. I'm doing this in Ruby 1.8.6 which doesn't support look behind in regexps so I tried:

# try to replace \n proceeded by anything other than \r with \r\n
str.gsub(/([^\r])\n/, "\\1\r\n") # \\1 is the captured character to be kept

Given a string One\n\nTwo\r\nThree the intention was for \n\n to be replaced with \r\n\r\n and the existing \r\n between Two and Three to be left unmodified. However only the first of the two \n match. i.e. the result is:

 "One\r\n\nTwo\r\nThree"

I tried this in a couple of other regexp engines with the same result.

In the end I was able to solve this by using Oniguruma (which does support positive and negative look behind) instead of Ruby's built in regexps but I am still interested in why my alternative approach didn't work as I expected.

Thanks for any answers.

+1  A: 

You could simply replace \r?\n with \r\n:

s = s.gsub(/\r?\n/, "\r\n")

That way, all \r\n's and \n's are replaced by \r\n.

Bart Kiers
Ooh thanks. I will replace the code with that as it removes the dependency on the Oniguruma gem but am still interested in why my other regexp didn't work as I thought it would.
mikej
The reason why is that in `One\n\nTwo` after the 1st `\n` is replaced by `\r\n` there is no character left before the 2nd '\n` to satisfy the `[^\r]` in your expression.
rsp
You could make your original approach work by matching the linefeed in a lookahead: `s.gsub!(/([^\r])(?=\n)/, "\\1\r")`. But it still fails if `\n` is the very first character, so Bart's way is more correct as well as clearer.
Alan Moore
@rsp if you post your comment in an answer I can vote it up, @Alan Moore thanks for additional thoughts
mikej
+1  A: 

Just writing to explain (rsp's comment says the same thing) why your original regex didn't work. The regex engine first matches ([^\r])\n at the ^ characters:

One\r\n\nTwo\r\nThree
   ^^^^

After the first replacement, the regex engine is at the ^:

One\r\n\nTwo\r\nThree
       ^

It's now trying to match ([^\r])\n, but there is no character that is not \n at the caret position. So it won't match until the \r\n between Two and Three.

Andomar
my conclusion is it should be ([^\r])?\n to cover the case of no characters there?
Ape-inago
@Ape-inago: `([^\r])?\n` would match any `\n`, even `\r\n`. Bart already posted the best solution I'd say.
Andomar
So are you saying that after the first replacement the regexp cursor is at the 2nd \n as you showed which gets matched by the [^\r] and then the T of Two gets compared with \n which doesn't match? In effect the problem is caused because the length of the character positions have been changed by adding an extra character in the substitution part way through the matching?
mikej
@mikej: Yes, `[^\r]` is a positive assertion: it says that there must be a character that's not `\r`. A negative assertion would say there may not be a character that's \r. In regex terms that's `(?<!\r)\n`, but not regex flavors support negative lookbehind.
Andomar