tags:

views:

1083

answers:

5

I have a big string (let's call it a CSV file, though it isn't actually one, it'll just be easier for now) that I have to parse in C# code. The first step of the parsing process splits the file into individual lines by just using a StreamReader object and calling ReadLine until it's through the file. However, any given line might contain a quoted (in single quotes) literal with embedded newlines. I need to find those newlines and convert them temporarily into some other kind of token or escape sequence until I've split the file into an array of lines..then I can change them back.

Example input data:

1,2,10,99,'Some text without a newline', true, false, 90
2,1,11,98,'This text has an embedded newline 
                and continues here', true, true, 90

I could write all of the C# code needed to do this by using string.IndexOf to find the quoted sections and look within them for newlines, but I'm thinking a Regex might be a better choice (i.e. now I have two problems)

A: 

EDIT: Sorry, I've misinterpreted your post. If you're looking for a regex, then here is one:

content = Regex.Replace(content, "'([^']*)\n([^']*)'", "'\1TOKEN\2'");

There might be edge cases and that two problems but I think it should be ok most of the time. What the Regex does is that it first finds any pair of single quotes that has \n between it and replace that \n with TOKEN preserving any text in-between.

But still, I'd go state machine like what @bryansh explained below.

chakrit
+1  A: 

What if you got the whole file into a variable then split that based on non-quoted newlines?

EBGreen
A: 

You could also consider using an established CSV parser (such as using ODBC) but I'm not positive that quoted newlines are valid.

EBGreen
+3  A: 

Since this isn't a true CSV file, does it have any sort of schema?

From your example, it looks like you have: int, int, int, int, string , bool, bool, int

With that making up your record / object.

Assuming that your data is well formed (I don't know enough about your source to know how valid this assumption is); you could:

  1. Read your line.
  2. Use a state machine to parse your data.
  3. If your line ends, and you're parsing a string, read the next line..and keep parsing.

I'd avoid using a regex if possible.

bryansh
+2  A: 

State-machines for doing such a job are made easy using C# 2.0 iterators. Here's hopefully the last CSV parser I'll ever write. The whole file is treated as a enumerable bunch of enumerable strings, i.e. rows/columns. IEnumerable is great because it can then be processed by LINQ operators.

public class CsvParser
{
 public char FieldDelimiter { get; set; }

 public CsvParser()
  : this(',')
 {
 }

 public CsvParser(char fieldDelimiter)
 {
  FieldDelimiter = fieldDelimiter;
 }

 public IEnumerable<IEnumerable<string>> Parse(string text)
 {
  return Parse(new StringReader(text));
 }
 public IEnumerable<IEnumerable<string>> Parse(TextReader reader)
 {
  while (reader.Peek() != -1)
   yield return parseLine(reader);
 }

 IEnumerable<string> parseLine(TextReader reader)
 {
  bool insideQuotes = false;
  StringBuilder item = new StringBuilder();

  while (reader.Peek() != -1)
  {
   char ch = (char)reader.Read();
   char? nextCh = reader.Peek() > -1 ? (char)reader.Peek() : (char?)null;

   if (!insideQuotes && ch == FieldDelimiter)
   {
    yield return item.ToString();
    item.Length = 0;
   }
   else if (!insideQuotes && ch == '\r' && nextCh == '\n') //CRLF
   {
    reader.Read(); // skip LF
    break;
   }
   else if (!insideQuotes && ch == '\n') //LF for *nix-style line endings
    break;
   else if (ch == '"' && nextCh == '"') // escaped quotes ""
   {
    item.Append('"');
    reader.Read(); // skip next "
   }
   else if (ch == '"')
    insideQuotes = !insideQuotes;
   else
    item.Append(ch);
  }
  // last one
  yield return item.ToString();
 }

}

Note that the file is read character by character with the code deciding when newlines are to be treated as row delimiters or part of a quoted string.

Duncan Smart