views:

80

answers:

6

Hi all,

I've got a simple application that opens a tab-delimited text file, and inserts that data into a database.

I'm using this CSV reader to read the data: http://www.codeproject.com/KB/database/CsvReader.aspx

And it is all working just fine!

Now my client has added a new field to the end of the file, which is "ClaimDescription", and in some of these claim descriptions, the data has quotes in it, example:

"SUMISEI MARU NO 2" - sea of Japan

This seems to be causing a major headache for my app. I get an exception which looks like this:

The CSV appears to be corrupt near record '1470' field '26 at position '181'. Current raw data : ...

And in that "raw data", sure enough the claim description field shows data with quotes in it.

I want to know if anyone has ever had this problem before, and got round it? Obviously I can ask the client to change the data they originally send to me, but this is an automated process that they use to generate the tab-delimited file; and I'd rather use that as a last resort.

I was thinking I could maybe open the file using a standard TextReader before hand, escape any quotes, write the content back into a new file, then feed that file into the CSV Reader. It is probably worth mentioning that the average file size of these tab-delimited files is around 40MB.

Any help is greatly appreciated!

Cheers, Sean

+2  A: 

Use the FileHelpers library instead. It is widely used and will cope with quoted fields, or fields that contain quotes.

Oded
see this --> http://www.secretgeek.net/csv_trouble.asp
IanL
@Oded: The question isn't asking how to cope with quoted fields. It's asking about *unquoted* fields that contain quote characters.
LukeH
@Luke: Hmmm. I started to disagree with you, on the basis that there is no real CSV "standard". I did find an RFC for it though, and it looks like you are right according to that.
T.E.D.
@T.E.D. - There is an RFC for CSV, but not one for Tab delimited.
Oded
@Oded: A good point. I suppose you *could* make the case that it should be the same as the CSV RFC, but with tabs instead of commas. It would be a better case if they'd said such a thing in the RFC somewhere though. It wouldn't have been hard to do.
T.E.D.
+2  A: 

Check the comment on the codeproject article about quotes:

http://www.codeproject.com/Messages/3382857/Re-Quotes-inside-of-the-Field.aspx

You need to specify in the constructor that you want another character besides " to be used as quotes.

Mikael Svenson
+1 This is what you need to do. If `"` is used as a quote character elsewhere in the CSV, the file is just inconsistent and there is no clean solution
Gabe Moothart
+1  A: 

Maybe you can open the file with your application and replace each quote with another character and then process it.

masoud ramezani
A: 

I did some searching, and there is an RFC for CSV files (RFC 4180), and that does explicitly prohibit what they are doing:

Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

Basicly, if they want to do that, they need to enclose that whole field in quotes, like so:

,""SUMISEI MARU NO 2" - sea of Japan",

So if you want you can throw this problem back at them and insist they send you a "proper" RFC 4180 CSV file.

Since you have access to the source files for that CSV reader, another option would be to modify it to handle the kind of quoted strings they are feeding you.

This kind of situation is exactly why it is vital to have source code access to your toolset.

If instead you'd like to preprocess (hack) their files before feeing them to your tool, the correct method would be to look for fields with a quote not immediately in front of or behind a separator, and enclose its whole field in another set of quotes.

T.E.D.
A: 

Right - after a late night of redbull and scratching my head, i eventually found the problem, it was commas in the "Claim_Description" field. Didn't even think about that because I was using a tab-delimited file, but as soon as i did a find and replace on all commas in the file it worked absolutely fine!

The next step is to find out how to replace those commas before processing.

Again, thanks for all the suggestions.

Cheers, Sean

seanxe