tags:

views:

109

answers:

6

i am trying to parse a file that looks like this:

|| Column Header A || Column Header B || Column Header C ||CRLF
| Data A | Data B | Data C |CRLF
| Data A | Data B | Data C |CRLF

"CRLF" represents a line break

i had code to parse this fine:

I first parse the file into an array of lines:

 string[] lines = fileString.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

Then i parse each row to an array of column data values,

First, I parse to get the header using:

  string Delimiter = "||";
  string[] columns = line.Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries);

then parse the rest of the rows using

    string Delimiter = "|";
  string[] columns = line.Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries);

this worked perfectly until i found a record that had a CRLF inside of a field so my parsing broke up

Can anyone think of a good way to parse this data below that factors in the fact that a field in a row may have a CRLF. Here is an example:

|| Column Header A || Column Header B || Column Header C ||CRLF
| Data A | Data B | Data C |CRLF
| Data A | Data B CRLF Continued B | Data C |CRLF

the issue is that when i do the initial parsing to get the array of lines, i get 4 lines here instead of 3 (because the last line shows up as two entries in that array.

A: 

You should consider a CSV parsing library.

However, you could do something like (more proof of concept than best case) this if you are really against that path and can guarantee your column headers are free of miscellaneous CRLFs

string Delimiter = "||"; 

string[] columns = fileString.Substring(0, fileString.IndexOf(Environment.NewLine))
   .Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries); 

string[] cells = fileString.Substring(fileString.IndexOf(Environment.NewLine))
   .Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries); 

List<string> rows = new List<string>();
StringBuilder row = new StringBuilder();
int colIndex = 0;
int breakIndex = columns.Length;
char[] trimChars = new char[] { '\r','\n',' ' };

foreach(string c in cells)
{
   if (cellIndex == breakIndex)
   {
       rows.Add(row.ToString().Trim(trimChars));
       cellIndex = 0;
       row = new StringBuilder();
   }
   row.Append(c).Append(" ");
   cellIndex ++;
}
rows.Add(row.ToString().Trim(trimChars));
Graphain
@Graphain - i dont follow, the issue is that record is showing up as two lines when i do the parse for each line. Also, this isn't a CSV file that i am parsing . .
ooo
I misread, use content.Replace instead.
Graphain
@Graphain - the issue is on the first line above that i am parsing as that record is getting split up into multiple lines and NOT showing up as one line in the array of lines.
ooo
@ooo you've amended your question so I see now that you want to ignore CRLF within your text but not at end of line. How can it possibly know it's at the "end of line" yet with a naive string.split? You'll have to use a CSV parser
Graphain
@Graphain - can you clarify as i am not following your solution. where would you put this line in the above code. if i am getting rid or all CRLF upfront, what am using as the delimiter to parse with ??
ooo
@ooo exactly, you cannot distinguish between a CRLF that doesn't belong and one that does. I'll write a quick "solution" but a csv parser would probably help with this.
Graphain
@ooo updated my example
Graphain
A: 

Just and idea based on what you've shown in the question:

Remove all the CRLF that don't appear right after | or || letting the last one there (to mark the line break). Doing this I think your current code will still work the way you want.

Something like this:

string wrongLine = "| Data A | Data B \r\n Continued B | Data C |\r\n";

string rightLine = wrongLine.Replace(" " + Environment.NewLine, string.Empty);

It'll give you this output (maintaining the last CRLF):

"| Data A | Data B Continued B | Data C |\r\n"
Leniel Macaferi
@Leniel Macaferi - i dont understand your answer
ooo
@Leniel Macaferi - i show an example at the end of the question. Essentially someone put in a data into that field that itself was multi line so there is a linebreak inside that field. does that clarify ?
ooo
+3  A: 

What you have here is delimited text. String.Split() is a notoriously naive choice for parsing that kind of data. It's slow and prone to problems such as what you're experiencing now. A better solution is something like the Microsoft.VisualBasic.TextFieldParser class or the Fast CSV parser over on codeproject.

Joel Coehoorn
+! for the reference to TextFieldParser. I wasn't aware of that being included in the BCL...shame it's tucked away in Micrisoft.VisualBasic.dll where it won't get the love or attention it should.
Mark Brackett
@Joel Coehoorn - can i acces the TextFieldParser from C#
ooo
Yes, you can use VB code inside your C# project...
Leniel Macaferi
@ooo yes, you can use that class in your C# probject, and to clarify leniel's comment, it doesn't mean writing an vb code anywhere.
Joel Coehoorn
@Joel - Will the TextFieldParser and/or CSV parser work without the embedded CRLF being quoted? Docs don't seem to indicate it...
Mark Brackett
@Mark. Not sure about TextFieldParser, but I think Fast CSV has an option for that if you use the first record as column headers.
Joel Coehoorn
+2  A: 

Not exactly elegant, but this brute-force solution is the first to come to mind. Split, and then combine if short:

var lines = content.Split(...);
string header[] = lines[0].Split(...);
int numberOfColumns = header.Length;

var parsedLines = new List<string[]>();
for (int i = 1; i < lines.Length; i++) {
   var line = lines[i];

   while ((fields = line.Split(...)).Length < numberOfColumns) {
     // combine with next, and increment i
     line += lines[++i];
   }

   parsedLines.Add(fields);
}
Mark Brackett
@Mark Brackett - the issue with this is that the first line will fail as this "exception" record will end up as two entries in the lines array
ooo
@ooo - Yes, it does end up twice in `lines`. But, when you're grabbing the fields you notice that it's short and keep grabbing the next `line` until you have enough fields. The only caveat is that the header row can't have CRLF (unless you know ahead of time what the number of columns should be).
Mark Brackett
@Mark Brackett - gotcha . . i responded too quickly :)
ooo
@Mark Brackett - i understand where you are going but i still think there is a bug in your code, your fields.Union is still not combining that data record into one record, its still showing up in multiple fields because of the CRLF in between then. Hopefully this make sense?
ooo
@ooo - Fixed by combining the line and parsing afterwards. You end up eating the embedded CRLF, but you could easily add it back if needed.
Mark Brackett
A: 

This is a classic example of Bad Data, or rather bad choice of delimiters. Before writing a parser, you must be 100% sure about the data your code would expect.

In this case you encountered a CRLF in your data, how would you(or your code) know that its not actually a delimiter?

I'd say use a better delimiter if you have the choice.

EDIT: You need to have an understanding with the sender on the delimiter, and then it is the sender's responsibility to ensure the data qualtity.

Looking at your sample data, '|CRLF' seems to be a good delimiter instead of 'CRLF'. But how do you(the parser) make sure that this delimiter does not occur in the actual data? You cannot. What you can do is to validate the quality of data against the pattern agreed with the sender (ex. no of columns in a record etc). And if the validation fails, report the error back to sender and ask for re-transmit.

A better approach would be for the sender to give you a header with the details of the data (i.e no of records, no of columns etc.)

As a parser, your control over the data is limited. This problem NEEDS support from the sender.

Srikanth Venugopalan
@Whiskey-Tango-Foxtrot - you are simply repeating the question, i am NOT choosing the delimiter, this is what i am getting and i am trying to get help coming up with a parser that will work on this dataset
ooo
+1  A: 

There's a simple fix in this case:

Grab one line. Does it end with a |? If not, add a CRLF and the next line to it. Repeat until it does end in |, then parse it.

Loren Pechtel