views:

107

answers:

2

Hello there.

I've gotten a CSV file supposedly with seven columns, and I was doing just fine until recently, they started using commas inside the third data column, and also CR and LF characters, also inside the third column, all of this between double quotes.

I was reading it line by line, so I could cross-check it with another file, and point it to the right row in the other file, but now that they are including new line codes and also commas, my code is just messed up.

For example:

  1. 1,4778,"El murciélago estaba navegando",10/08/2010,906610,13496-86219-1,1. This one runs just fine.

  2. 1,4778,"El murciélago estaba navegando,
    y además estaba de parranda",10/08/2010,906610,13496-86219-1,1
    . Now this is bad.

Any way you suggest to solve this? The third data column is ALWAYS the one with these sudden changes. The others will never get to work with new lines or additional commas, so there is no need to filter those...

Thanks in advance!

+3  A: 

Probably the cleanest, most systematic way to handle it is to read character by character and use a small state machine, to deal with things like "we're inside a quoted string, so ignore any comma, CR or LF".

One way to do this is to build an array where each row is a current state, and each column is a possible input character. You read an input character, and based on the current state and the input character, you get a next state to enter. You'll typically also have a case statement to take actions based on the current state and the next state (e.g., append the current character to your CurrentField string when what you've read is allowed as part of a field), or save your current field string when you get to the end of the field.

So, you start in the Start state. In the Start state, if you see a quote, you go to the QuotedField state. If you see a letter or digit, you go to the UnQuotedField state. If you see a comma, you go to the EndField state (i.e., you just read an empty field). If you see much of anything else, you go to the Error state.

In the QuotedField state, anything but a quote is accepted and leaves you in the QuotedField state. When you do see a quote, you need to check whether the next character is a quote (a doubled quote translates to a quote mark embedded in the field) or something else (meaning the quote marked the end of the field). You can do that either by writing a bit of code by hand to peek at the next input character, and check if its a quote, or you can encode another state that goes back to the QuotedField state if it gets a quote, the EndField state if it finds a comma, or the Error state for almost anything else (e.g., a letter or digit).

Jerry Coffin
That sounds like a good idea, though for some reason this looks like a big hit to performance for files over 2MB. Thank you!
KaOSoFt
@KaOSoFt: it does initially sound pretty slow, but in fact it's frequently pretty fast -- in particular, you manage a lot of the work with a simple table lookup, and (unless you have a lot of states) that entire table typically fits in the processor cache, so the lookup is *quite* fast.
Jerry Coffin
Look at the code in TStrings.SetDelimitedText to see how TStrings/TStringList does this. It handles everything including the CR/LF **if** the data is correctly quoted. The harder part wil be finding the end of each "line"
Gerry
I took the Tokens example on this page, and it certainly has some ideas you mention: http://www.efg2.com/Lab/Library/Delphi/Strings/index.htmlThanks!
KaOSoFt
+1  A: 

It is almost unknown, but the BDE can read CSV files - it just need a definition file to tell it what format the file is. How to use it is in the BDE documentation.

ldsandon