ansaurus

Question

what is the best way to parse this file in C#, where i have a CRLF inside a field

Answer 1

A:

You should consider a CSV parsing library.

However, you could do something like (more proof of concept than best case) this if you are really against that path and can guarantee your column headers are free of miscellaneous CRLFs

string Delimiter = "||"; 

string[] columns = fileString.Substring(0, fileString.IndexOf(Environment.NewLine))
   .Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries); 

string[] cells = fileString.Substring(fileString.IndexOf(Environment.NewLine))
   .Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries); 

List<string> rows = new List<string>();
StringBuilder row = new StringBuilder();
int colIndex = 0;
int breakIndex = columns.Length;
char[] trimChars = new char[] { '\r','\n',' ' };

foreach(string c in cells)
{
   if (cellIndex == breakIndex)
   {
       rows.Add(row.ToString().Trim(trimChars));
       cellIndex = 0;
       row = new StringBuilder();
   }
   row.Append(c).Append(" ");
   cellIndex ++;
}
rows.Add(row.ToString().Trim(trimChars));

Graphain 2010-07-12 01:06:16

@Graphain - i dont follow, the issue is that record is showing up as two lines when i do the parse for each line. Also, this isn't a CSV file that i am parsing . .

ooo 2010-07-12 01:07:08

I misread, use content.Replace instead.

Graphain 2010-07-12 01:08:04

@Graphain - the issue is on the first line above that i am parsing as that record is getting split up into multiple lines and NOT showing up as one line in the array of lines.

ooo 2010-07-12 01:14:34

@ooo you've amended your question so I see now that you want to ignore CRLF within your text but not at end of line. How can it possibly know it's at the "end of line" yet with a naive string.split? You'll have to use a CSV parser

Graphain 2010-07-12 01:23:35

@Graphain - can you clarify as i am not following your solution. where would you put this line in the above code. if i am getting rid or all CRLF upfront, what am using as the delimiter to parse with ??

ooo 2010-07-12 01:24:34

@ooo exactly, you cannot distinguish between a CRLF that doesn't belong and one that does. I'll write a quick "solution" but a csv parser would probably help with this.

Graphain 2010-07-12 01:34:14

@ooo updated my example

Graphain 2010-07-12 01:42:30

Answer 2

A:

Just and idea based on what you've shown in the question:

Remove all the CRLF that don't appear right after | or || letting the last one there (to mark the line break). Doing this I think your current code will still work the way you want.

Something like this:

string wrongLine = "| Data A | Data B \r\n Continued B | Data C |\r\n";

string rightLine = wrongLine.Replace(" " + Environment.NewLine, string.Empty);

It'll give you this output (maintaining the last CRLF):

"| Data A | Data B Continued B | Data C |\r\n"

Leniel Macaferi 2010-07-12 01:08:05

@Leniel Macaferi - i dont understand your answer

ooo 2010-07-12 01:09:12

@Leniel Macaferi - i show an example at the end of the question. Essentially someone put in a data into that field that itself was multi line so there is a linebreak inside that field. does that clarify ?

ooo 2010-07-12 01:12:04

Answer 3

+3 A:

What you have here is delimited text. String.Split() is a notoriously naive choice for parsing that kind of data. It's slow and prone to problems such as what you're experiencing now. A better solution is something like the Microsoft.VisualBasic.TextFieldParser class or the Fast CSV parser over on codeproject.

Joel Coehoorn 2010-07-12 01:13:57

+! for the reference to TextFieldParser. I wasn't aware of that being included in the BCL...shame it's tucked away in Micrisoft.VisualBasic.dll where it won't get the love or attention it should.

Mark Brackett 2010-07-12 01:16:57

@Joel Coehoorn - can i acces the TextFieldParser from C#

ooo 2010-07-12 01:18:43

Yes, you can use VB code inside your C# project...

Leniel Macaferi 2010-07-12 01:20:00

@ooo yes, you can use that class in your C# probject, and to clarify leniel's comment, it doesn't mean writing an vb code anywhere.

Joel Coehoorn 2010-07-12 01:27:24

@Joel - Will the TextFieldParser and/or CSV parser work without the embedded CRLF being quoted? Docs don't seem to indicate it...

Mark Brackett 2010-07-12 02:30:56

@Mark. Not sure about TextFieldParser, but I think Fast CSV has an option for that if you use the first record as column headers.

Joel Coehoorn 2010-07-12 03:10:51

Answer 4

+2 A:

Not exactly elegant, but this brute-force solution is the first to come to mind. Split, and then combine if short:

var lines = content.Split(...);
string header[] = lines[0].Split(...);
int numberOfColumns = header.Length;

var parsedLines = new List<string[]>();
for (int i = 1; i < lines.Length; i++) {
   var line = lines[i];

   while ((fields = line.Split(...)).Length < numberOfColumns) {
     // combine with next, and increment i
     line += lines[++i];
   }

   parsedLines.Add(fields);
}

Mark Brackett 2010-07-12 01:14:15

@Mark Brackett - the issue with this is that the first line will fail as this "exception" record will end up as two entries in the lines array

ooo 2010-07-12 01:16:00

@ooo - Yes, it does end up twice in `lines`. But, when you're grabbing the fields you notice that it's short and keep grabbing the next `line` until you have enough fields. The only caveat is that the header row can't have CRLF (unless you know ahead of time what the number of columns should be).

Mark Brackett 2010-07-12 01:22:43

@Mark Brackett - gotcha . . i responded too quickly :)

ooo 2010-07-12 01:23:29

@Mark Brackett - i understand where you are going but i still think there is a bug in your code, your fields.Union is still not combining that data record into one record, its still showing up in multiple fields because of the CRLF in between then. Hopefully this make sense?

ooo 2010-07-12 01:45:54

@ooo - Fixed by combining the line and parsing afterwards. You end up eating the embedded CRLF, but you could easily add it back if needed.

Mark Brackett 2010-07-12 02:26:37

Answer 5

A:

This is a classic example of Bad Data, or rather bad choice of delimiters. Before writing a parser, you must be 100% sure about the data your code would expect.

In this case you encountered a CRLF in your data, how would you(or your code) know that its not actually a delimiter?

I'd say use a better delimiter if you have the choice.

EDIT: You need to have an understanding with the sender on the delimiter, and then it is the sender's responsibility to ensure the data qualtity.

Looking at your sample data, '|CRLF' seems to be a good delimiter instead of 'CRLF'. But how do you(the parser) make sure that this delimiter does not occur in the actual data? You cannot. What you can do is to validate the quality of data against the pattern agreed with the sender (ex. no of columns in a record etc). And if the validation fails, report the error back to sender and ask for re-transmit.

A better approach would be for the sender to give you a header with the details of the data (i.e no of records, no of columns etc.)

As a parser, your control over the data is limited. This problem NEEDS support from the sender.

Srikanth Venugopalan 2010-07-12 01:15:48

@Whiskey-Tango-Foxtrot - you are simply repeating the question, i am NOT choosing the delimiter, this is what i am getting and i am trying to get help coming up with a parser that will work on this dataset

ooo 2010-07-12 01:17:19

Answer 6

+1 A:

There's a simple fix in this case:

Grab one line. Does it end with a |? If not, add a CRLF and the next line to it. Repeat until it does end in |, then parse it.

Loren Pechtel 2010-07-12 01:24:01

ansaurus

tags:

views:

answers:

what is the best way to parse this file in C#, where i have a CRLF inside a field

related questions