tags:

views:

1259

answers:

5

I have a data stream that may contain \r, \n, \r\n, \n\r or any combination of them. Is there a simple way to normalize the data to make all of them simply become \r\n pairs to make display more consistent?

So something that would yield this kind of translation table:

\r     --> \r\n
\n     --> \r\n
\n\n   --> \r\n\r\n
\n\r   --> \r\n
\r\n   --> \r\n
\r\n\n --> \r\n\r\n
+3  A: 

A Regex would help.. could do something roughly like this..

(\r\n|\n\n|\n\r|\r|\n) replace with \r\n

This regex produced these results from the table posted (just testing left side) so a replace should normalize.

\r   => \r 
\n   => \n 
\n\n => \n\n 
\n\r => \n\r 
\r\n => \r\n 
\r\n => \r\n 
\n   => \n
Quintin Robinson
Except if it containe \r\n already, the replacement would expand that to \r\n\r\n. Same for \n\r. I believe the answer is in the arcane language of regex, but it's a black art to me.
ctacke
CQ, that doesn't do what he asked for. A regex might work, but not as you've posted it.
Derek Park
Agreed I did not account of existing \r\n
Quintin Robinson
That is why I said roughly though, a little tweaking like preeceding an \r\n might resolve this.
Quintin Robinson
+1  A: 

You're thinking too complicated. Ignore every \r and turn every \n into an \r\n.

In Pseudo-C#:

char[] chunk = new char[X];
StringBuffer output = new StringBuffer();

buffer.Read(chunk);
foreach (char c in chunk)
{
   switch (c)
   {
      case '\r' : break; // ignore
      case '\n' : output.Append("\r\n");
      default   : output.Append(c);
   }
 }

EDIT: \r alone is no line-terminator so I doubt you really want to expand \r to \r\n.

VVS
He wants standalone \r to turn into \r\n as well.
Derek Park
Hm. Can't believe he really wants that :)
VVS
Macs used CR for linebreaks up to MacOS 9. It's \n\r that surprises me.
Steve Jessop
Macs use CR for linebreaks? Didn't know that..
VVS
+2  A: 

I believe this will do what you need:

using System.Text.RegularExpressions;
// ...
string normalized = Regex.Replace(originalString, @"\r\n|\n\r|\n|\r", "\r\n");

I'm not 100% sure on the exact syntax, and I don't have a .Net compiler handy to check. I wrote it in perl, and converted it into (hopefully correct) C#. The only real trick is to match "\r\n" and "\n\r" first.

To apply it to an entire stream, just run in on chunks of input. (You could do this with a stream wrapper if you want.)


The original perl:

$str =~ s/\r\n|\n\r|\n|\r/\r\n/g;

The test results:

[bash$] ./test.pl
\r -> \r\n
\n -> \r\n
\n\n -> \r\n\r\n
\n\r -> \r\n
\r\n -> \r\n
\r\n\n -> \r\n\r\n


Update: Now converts \n\r to \r\n, though I wouldn't call that normalization.

Derek Park
This did not meet the requirements of the above example in the table.. Look at the regex I modified, you need to account for \n\n.
Quintin Robinson
This one is close, but \n\r should simply swap the elements to be a \r\n (saw this input from a VB developer's code)
ctacke
Ok, made that change. I wouldn't consider that normalization, but it's easy enough to add to the regex.
Derek Park
You will need to remove the '@' from the replacement string. If you don't it will replace '\r\n' with '\\r\\n' because you are asking for the literal string "\r\n". Even better would be to replace with the Environment.NewLine constant.
NerdFury
Thanks for catching that, NerdFury. I removed the @ from the replacement string. I would change it to the NewLine constant, but since he specifically asked for "\r\n", I figure I should leave that alone.
Derek Park
A: 

I'm with Jamie Zawinski on RegEx:

"Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems"

For those of us who prefer readability:

  • Step 1

Replace \r\n by \n Replace \n\r by \n (if you really want this, some posters seem to think not) Replace \r by \n

  • Step 2 Replace \n by Environment.NewLine or \r\n or whatever.
Joe
A: 

I agree Regex is the answer, however, everyone else fails to mention Unicode line separators. Those (and their variations with \n) should be included.

leppie