ansaurus

Question

Parsing CSV file with un-escaped quotation marks and commas in .NET

Answer 1

A:

This can be a bit of a challenge - however writing code to manually parse a line of csv data is fairly straightforward.

Basically you've got two possible cases:

Where you are guaranteed that every field is delimited by double quotes
Where any string that contains commas will be bounded by double quotes - things could, potentially, get interesting if you've got commas inside quoted strings so you may have to set limits to how clever you want to be (what the provided data can contain).

For the first case basically all you have to do is split the string on the "," combination (this may have to be done manually) - remembering to remove the leading " from the first column/item and the trailing " from the last column/item - and you're pretty much there.

For the second case you have to do more work - either a) split the line on each comma and then process through the results stitching adjacent items back together where an item starts with a double quote but doesn't end with one (you concatenate fields - not forgetting to include the comma - 'til you find the corresponding item that ends with a " but doesn't start with one) or b) loop through the string at each iteration you pull the first item from the list (if it starts with " then you're looking for ", to end it otherwise you're just looking for a comma on its own and allowing in both instances for the boundary case of the delimeter not being found because you're looking at the last column.

Hope this helps. A bit.

Murph 2009-04-22 10:22:57

Answer 2

A:

You could try to tell apart quotations that delimit a field from those that don't checking that the following regex can be matched on the preceding or following characters @"\p{Z},\p{Z}" Using the same concept, you could tell commas in text apart from delimiting commas by the presence of nearby quotation marks. Running these heuristsics and escaping commas and quotes not matching the rules above should yield a pretty clean result Of course there could be cases when this doesn't work fine, but as far as I know, if the CSV is unescaped there'll be no better way to handle this.

emaster70 2009-04-22 10:40:14

Answer 3

A:

If your code doesn't need to be part of an application, but you just want to generate SQL statements to update a database from CSV data, you may be interested in my own CSVFix project, whih is FOSS & allows you to manipulate CSV data in many ways without programming.

anon 2009-04-22 10:43:58

Answer 4

+1 A:

Don't really understand the answers so far.

As far as I can tell, your example input string can't be parsed without adding extra restrictions.

It is your third field which presents a problem. How can the parser tell that after the " between third and Field we haven't finished the third element? By detecting that the following character is not a comma? What if the third field ALSO contained a comma, is that allowed?

"first field", "second , field", "third ", Field " ", "fourth field"

At that point, you're getting into multi-pass trial and error parsing (though I'm no expert). That is, the parser would have to get to the F of Field, and realise that that is unparsable, back up and include the ", in the third field and carry on to the following "... etc

[Even Excel can't cope with that string as it stands.]

Benjol 2009-04-22 12:09:04

One assumes a rule that says that quotes come in pairs. After that its less complex.Observing that a given tool can't cope doesn't mean that a problem can't be solved.

Murph 2009-04-22 21:51:42

If quotes come in pairs, does that exclude pairs of quotes within the quotes or not? Excel was just an example.

Benjol 2009-04-23 05:21:29

Answer 5

A:

cool. Thanks for all your help guys. The files I am dealing with are very large so manual parsing would not be efficient.

I have solved the problem by getting proper csv files with escaped quotation marks and adding validation and error handling for when this hasnt happened.

2009-04-24 09:04:53

You might also want to try a library like CSV Helper. http://csvhelper.com

Josh Close 2010-02-22 23:58:40

ansaurus

tags:

views:

answers:

Parsing CSV file with un-escaped quotation marks and commas in .NET

related questions