tags:

views:

217

answers:

5

I have a date import project in which the clients send ANSI-latin1 encoded files (iso-8859-1). However... It seems that on a weekly basis we get a surprise file, one that is not of the correct format and the import basically dies horribly and needs manual intervention to recover and move on... Most common bad file formats seem to be excel, compress file or an XML/HTML file...

So in order to mitigate the human intervention, I would like to reasonably determine if we have a strong ANSI candidate file, before attempting to go through each line of the file looking for 1 of 64 bad characters and then making a guestimate on whether the whole line or file is bad on the # of bad characters found…

I was thinking of maybe making a Unicode/UTF check and/or magic number check or evening trying to check for a few specific application types.. The files have no file extensions so any check would be by examining the content and any fast way to rule out the file as non-ANSI would be perfect, since the import process needs to process 100-500 records a second.

NOTE: Over 100 different types of bad files have been sent to us, including images and PDF's. So there is a concern about whether you can easily and quickly rule out LTOS of different non ANSI types rather than specifically targeting just a few...

+5  A: 

Given your example "bad" files types, I'd say put in series of quick checks on the first few bytes of the file:

  • Is a UTF-16 BOM?
  • Is it "<html" or "<!DOCTYPE"?
  • Is it "<xml"?
  • Does it have a NUL character?
  • Is it `PK\003\004' (the zip file header?)
  • Is it (whatever Excel files start with; you'll have to look that one up 8-)
RichieHindle
+1 nice answer.
Byron Whitlock
The actual list of bad file types received over the last year exceeds 100. NUL is an allowed character in our ANSI-latin1 implementation, due to a legacy ghost script implementation. I've been reading up on the UTF BOM's, but if I understand them correctly, they're rarely used...Love the other check suggestions, though I wodner how many compression headers I would have to check to cover the most commonly used ones?
Shire
Also might want to check whether the first line would import/looks like the your filetype.
Byron Whitlock
I'd bet zip, bzip, rar, sit (mac), and gzip would give you > 90% coverage.
Byron Whitlock
Love the new edits... It followed my initial thoughts when I thought the list was fairly small. Then when the EDI manager e-mailed me the spreadsheet of all the bad files and the file types they were able to determine they were later... Well... I started searching for a way to quickly check for LOTS of file types and rule them out in some way.
Shire
Unfortunately, the first 100 characters are alphanumeric and can be any combination of 20 fields used to determine the primary keys… I will in the process decode those and if I can’t resolve them mark the record as an error. However, we allow them to have headers in the export file, please don’t ask why… So the rule is they are allowed to have 20% of the records in the file be bad… I’d rather not have to process 20% of 1 million records to determine the file is bad…
Shire
Obviously I don't know much about your system, but there may well be a small number of filetypes that might account for half the bad inputs. Start with those, then add the more obscure ones as you learn which are more prevalent. Maybe you'll never get to 100% detection rate, but you ought to be able to make a big difference without a huge amount of effort.
RichieHindle
Can you get away with sequential errors? Like if the first 5000 records error out with no good rows, the chances are pretty good it is junk. That way you don't have to process 20% or 1mill rows.
Byron Whitlock
Well the current method is that if we have to correct mroe than 10 characters for a record it is bad. If we can't parse out the 3 primary keys for the record it is a bad record and on the 10th bad record, we mark the file as bad and roll back everything. So we have 100% detection already... But it is rather arbitrary and takes some processing, but it is not bad. But since I have to fix the workflow part as well, tryign to see if I can get better detection and hopefulyl report of what type of bad file we got...
Shire
+2  A: 

I like RichieHindle's answer it is very good. You should also look at error handling in your import. If you come across a bad file capture the error, log it and move on. You should not be stopping the import of other files, or worse the import of other customers because of one error in one file... If there was a way to notify the customer by email etc that the file could not be imported you might not have to do as much manual intervention.

Byron Whitlock
Just started working for this company last month and have suddenly inherited this program. It is my job to greatly improve the user experience. ;) I've also been looking at magic numbers and wondering if there is an library available that can detect these reliably without false positives...
Shire
+1  A: 

On a Unix-type system, you'd use the "file" command for this. I wonder if there's a port of "file" to Windows? I couldn't find one in Google, but I'd bet it's available on GNU.org somewhere...

If you happen to have a stockpile of typical "bad" files around, it'd be pretty easy to build a database of file signatures, similar to what "file" uses.

Mark Bessey
+1  A: 

Looking at the first few bytes is a good idea, but sometimes it can lead to false conclusions.

I recall making a CSV file for inserting values into a MySQL database, but I opened it with Excel first to check that everything looked OK.

Excel immediately said "This is a SYLK file, are you sure you know what you're doing?"

I'd never even heard of SYLK files before, but Wikipedia told me it's a CSV file with a header in which the first characters are 'ID,'.

This probably doesn't have any bearing on what you're doing, but I thought I'd point out that magic numbers aren't as magic as they might be.

pavium
A: 

You can read the beginning into a StreamReader and then call the CurrentEncoding property.

http://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding.aspx

Note that 100% reliably detecting the encoding is theoretically impossible. However, the CurrentEncoding property goes through the best general set of heuristics to make a good guess.

RichAmberale