views:

3111

answers:

3

How would you normalize all new-line sequences in a string to one type?

I'm looking to make them all CRLF for the purpose of email (MIME documents). Ideally this would be wrapped in a static method, executing very quickly, and not using regular expressions (since the variances of line breaks, carriage returns, etc. are limited). Perhaps there's even a BCL method I've overlooked?

ASSUMPTION: After giving this a bit more thought, I think it's a safe assumption to say that CR's are either stand-alone or part of the CRLF sequence. That is, if you see CRLF then you know all CR's can be removed. Otherwise it's difficult to tell how many lines should come out of something like "\r\n\n\r".

+2  A: 
string nonNormalized = "\r\n\n\r";

string normalized = nonNormalized.Replace("\r", "\n").Replace("\n", "\r\n");
Nathan
This example produces four line breaks, whereas the nonNormalized string contains two.
John Feminella
True, it brings up a good question as to when a sequence is used and when it is merely removed (ignored).
Neil C. Obremski
+9  A: 
input.Replace("\r\n", "\n").Replace("\r", "\n").Replace("\n", "\r\n")

This will work if the input contains only one type of line breaks - either CR, or LF, or CR+LF.

Daniel Brückner
+6  A: 

It depends on exactly what the requirements are. In particular, how do you want to handle "\r" on its own? Should that count as a line break or not? As an example, how should "a\n\rb" be treated? Is that one very odd line break, one "\n" break and then a rogue "\r", or two separate linebreaks? If "\r" and "\n" can both be linebreaks on their own, why should "\r\n" not be treated as two linebreaks?

Here's some code which I suspect is reasonably efficient.

using System;
using System.Text;

class LineBreaks
{    
    static void Main()
    {
        Test("a\nb");
        Test("a\nb\r\nc");
        Test("a\r\nb\r\nc");
        Test("a\rb\nc");
        Test("a\r");
        Test("a\n");
        Test("a\r\n");
    }

    static void Test(string input)
    {
        string normalized = NormalizeLineBreaks(input);
        string debug = normalized.Replace("\r", "\\r")
                                 .Replace("\n", "\\n");
        Console.WriteLine(debug);
    }

    static string NormalizeLineBreaks(string input)
    {
        // Allow 10% as a rough guess of how much the string may grow.
        // If we're wrong we'll either waste space or have extra copies -
        // it will still work
        StringBuilder builder = new StringBuilder((int) (input.Length * 1.1));

        bool lastWasCR = false;

        foreach (char c in input)
        {
            if (lastWasCR)
            {
                lastWasCR = false;
                if (c == '\n')
                {
                    continue; // Already written \r\n
                }
            }
            switch (c)
            {
                case '\r':
                    builder.Append("\r\n");
                    lastWasCR = true;
                    break;
                case '\n':
                    builder.Append("\r\n");
                    break;
                default:
                    builder.Append(c);
                    break;
            }
        }
        return builder.ToString();
    }
}
Jon Skeet
Very cool; this would definitely be useful on more arbitrary input! For my case I chose to go with an assumption (made an edit), but I voted this up regardless.
Neil C. Obremski
Right. If performance is really significant you may want to benchmark this solution against the accepted one - but only if you've actually ascertained that it's significant via a profiler! I would *hope* this is faster, as it only needs to make a single pass through the data.
Jon Skeet