tags:

views:

267

answers:

4

Scenario: Attempting to import many (>100), large(>1M recs) flat-files (csv).

Problem: Many records are missing field delimiters.

Asking: Is there a parsing tool that will attempt to define and validate the file and allow you to make "in-line" corrections?

ETA: I'm attempting to import this files to MS SQL Server using the DTS Import Wizard. The error message gives me the line number of the file that it breaks on. Fix; repeat.

+2  A: 

I would probably just knock something up in Python (or Perl or Awk).
How do you know where the fields are if the delimiters are missing?

edit - I would probably read in all the lines, ignore the existing delimiter, split them on the known content and write them out again. You only have to do this once and it's going to be quicker and easier than getting an error and fixing a particular line.

Martin Beckett
The fields can be deduced based on content. That's why I would hope for a tool to let me insert the field delimiter on a record that doesn't parse correctly.
Chris Gallucci
+4  A: 

Been there, done that. Wrote my own tool.

It's remarkable how many programs that allegedly output CSV don't actually do so correctly.

A commercial tool would be nice, but given the variety of problems that I've encountered in CSV files (missing delimiters, bad separator values, embedded CR/LF in the middle of fields, etc.), it was worth writing my own. That way when I find a new problem I just extend my existing program to deal with it.

I should probably change my nickname to NIH given my propensities.

Kluge
The problem is probably that there isn't really a widely-used standard (I follow http://www.rfc-editor.org/rfc/rfc4180.txt when writing CSV, but it doesn't help a low twhen you're reading it).
Joachim Sauer
+2  A: 

Also been there - advice is to don't waste time - write your own

mm2010
+1  A: 

If the flat files came from the same source, I agree that writing your own tools is one of the best options -- the problems should be consistent from file to file.

OTOH, if you have an ongoing need to import data from different providers, getting an import tool might be more productive.

It's been almost a decade since I've done ETL work, so I can't make any specific recommendations, though.

BTW, is it possible to have the flat files regenerated? The better solution is to not have broken data in the first place, rather than clean it up afterwards.

Toybuilder