ansaurus

Question

C# StreamReader.ReadLine() - Need to pick up line terminators

Answer 1

+1 A:

You can't change StreamReader to return the line terminators, and you can't change what it uses for line termination.

I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.

It sounds like you may need to work character by character, or possibly load the whole file first and do a global replace, e.g.

x.Replace("\r\n", "\u0000") // Or some other unused character
 .Replace("\n", "\\x0A") // Or whatever escaping you need
 .Replace("\u0000", "\r\n") // Replace the real line breaks

I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.

Jon Skeet 2009-03-20 20:12:24

+1 for "you may need to work character by character"

eglasius 2009-03-20 20:16:41

I kind of like my solution (which you can see below), but I really like developing compilers, so it could be one of those "to someone with a hammer everything looks like a nail" type of things

Scott Wisniewski 2009-03-20 21:13:17

Answer 2

+1 A:

Essentially, a hard-return in Excel (shift+enter or alt+enter, I can't remember) puts a newline that is equivalent to \x0A in the default encoding I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine(), which outputs the line plus a newline (which I believe is \r\n).

The CSV is fine and comes out exactly how Excel would save it, the problem is when I read it into the blank record remover, I'm using ReadLine() which will treat a record with an embedded newline as a CRLF.

Here's an example of the file after I convert to CSV...

Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)

As you can see, the first record has an embedded new-line after al-Numan. When I use ReadLine(), I get '1050,"Aziz Salih al-Numan' and when I write that out, WriteLine() ends that line with a CRLF. I lose the original line terminator. When I use ReadLine() again, I get the line starting with '1050a'.

I could read the entire file in and replace them, but then I'd have to replace them back afterwards. Basically what I want to do is get the line terminator to determine if its \x0a or a CRLF, and then if its \x0A, I'll use Write() and insert that terminator.

Tony Trozzo 2009-03-20 20:24:01

Answer 3

+2 A:

I would recommend that you change your architecture to work more like a parser in a compiler.

You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.

In your case the tokens would be:

Column data
Comma
End of Line

You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.

This has the advantages of:

Doing only 1 pass over the data
Only storing a max of 1 lines worth of data
Reusing as much memory as possible (for the string builder and the list)
It's easy to change should your requirements change

Here's a sample of what the Lexer would look like:

Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.

enum TokenType
{
    ColumnData,
    Comma,
    LineTerminator
}

class Token
{
    public TokenType Type { get; private set;}
    public string Data { get; private set;}

    public Token(TokenType type)
    {
        Type = type;
    }

    public Token(TokenType type, string data)
    {
        Type = type;
        Data = data;
    }
}

private  IEnumerable<Token> GetTokens(TextReader s)
{
   var builder = new StringBuilder();

   while (s.Peek() >= 0)
   {
       var c = (char)s.Read();
       switch (c)
       {
           case ',':
           {
               if (builder.Length > 0)
               {
                   yield return new Token(TokenType.ColumnData, ExtractText(builder));
               }
               yield return new Token(TokenType.Comma);
               break;
           }
           case '\r':
           {
                var next = s.Peek();
                if (next == '\n')
                {
                    s.Read();
                }

                if (builder.Length > 0)
                {
                    yield return new Token(TokenType.ColumnData, ExtractText(builder));
                }
                yield return new Token(TokenType.LineTerminator);
                break;
           }
           default:
               builder.Append(c);
               break;
       }

   }

   s.Read();

   if (builder.Length > 0)
   {
       yield return new Token(TokenType.ColumnData, ExtractText(builder));
   }
}

private string ExtractText(StringBuilder b)
{
    var ret = b.ToString();
    b.Remove(0, b.Length);
    return ret;
}

Your "parser" code would then look like this:

public void ConvertXLS(TextReader s)
{
    var columnData = new List<string>();
    bool lastWasColumnData = false;
    bool seenAnyData = false;

    foreach (var token in GetTokens(s))
    {
        switch (token.Type)
        {
            case TokenType.ColumnData:
            {
                 seenAnyData = true;
                 if (lastWasColumnData)
                 {
                     //TODO: do some error reporting
                 }
                 else
                 {
                     lastWasColumnData = true;
                     columnData.Add(token.Data);
                 }
                 break;
            }
            case TokenType.Comma:
            {
                if (!lastWasColumnData)
                {
                    columnData.Add(null);
                }
                lastWasColumnData = false;
                break;
            }
            case TokenType.LineTerminator:
            {
                if (seenAnyData)
                {
                    OutputLine(lastWasColumnData);
                }
                seenAnyData = false;
                lastWasColumnData = false;
                columnData.Clear();
            }
        }
    }

    if (seenAnyData)
    {
        OutputLine(columnData);
    }
}

Scott Wisniewski 2009-03-20 21:04:48

Thanks a ton Scott. This looks like the appropriate solution. I'm a recent CompSci graduate so it's great to see help like this available.

Tony Trozzo 2009-03-22 22:11:04

Answer 4

A:

I know I'm a little late to the game here, but I was having the same problem and my solution was a lot simpler than most given.

If you are able to determine the column count which should be easy to do since the first line is usually the column titles, you can check your column count against the expected column count. If the column count doesn't equal the expected column count, you simply concatenate the current line with the previous unmatched lines. For example:

string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
    if (lineCount == 0)
    {
        lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
        columnCount = lineData.length;
        ++lineCount;
        continue;
    }
    string thisLine = lastLine + currentLine;

    lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
    if (lineData.Length < columnCount)
    {
        lastLine += currentLine;
        continue;
    }
    else
    {
        lastLine = null;
    }
    ......

2010-03-12 21:41:34

ansaurus

tags:

views:

answers:

C# StreamReader.ReadLine() - Need to pick up line terminators

related questions