tags:

views:

98

answers:

3

I have an exceptionally bad CSV file. Although I "solved" the problem in the end by manually writing scripts to process and reprocess this specific file I wanted to know if there were any other solutions out there.

You have a CSV file that has all the fields terminated by | (pipe) characters. Running a quick check shows you that there are 53 fields in the file. The person who gave you the file claims there there are only 28 fields. Not all of the fields have information in them. For example there are five custom_field_{num} fields which may or may not have data.

How would you get this into a database nicely?

The ideal solution (and one I searched high and low for) would be to just throw it all into a table with no column names or specifications. Then remove any columns that were completely blank and then give them titles and specifications.

+1  A: 

You can't anticipate where the badly formed file will be badly formed. The next time the user gives you the file it can be in the middle of the file.

If you try to guess with a program, you may find that the data is all shifted right for a few rows. That would put things out of wack. For example, you may end up with "Last Name" appearing in the area code column.

Therefore, always plan to manually inspect the file first.

Edit: If the file is large, as you say 6k+ lines, I'd write a program to detect where the issue is, and I'd tell my file provider where they made the mistake. I would not try to automatically "correct" bad data.

aaaa bbbb
I have 6k+ records. I can't manually inspect them all.
Josh K
Ended up washing it with a couple of short scripts, worked out alright.
Josh K
A: 

I am not sure what you mean by "bad format". But if the separators are consistent and number of columns is the same across the lines, just use LOAD DATA INFILE MySQL statement. You can specify '|' as field separator there.

newtover
LOAD DATA is actually pretty picky about any sort of fomatting anomalies in the CSV
Eli
A: 

I find that MsAccess does an OK job of allowing you to align data by column or separate it using separators in a visual way. From there you can re-export or update directly into SQL

James Westgate