How to detect that a file is not ANSI-Latin1?

I have a date import project in which the clients send ANSI-latin1 encoded files (iso-8859-1). However... It seems that on a weekly basis we get a surprise file, one that is not of the correct format and the import basically dies horribly and needs manual intervention to recover and move on... Most common bad file formats seem to be excel, compress file or an XML/HTML file...

So in order to mitigate the human intervention, I would like to reasonably determine if we have a strong ANSI candidate file, before attempting to go through each line of the file looking for 1 of 64 bad characters and then making a guestimate on whether the whole line or file is bad on the # of bad characters found…

I was thinking of maybe making a Unicode/UTF check and/or magic number check or evening trying to check for a few specific application types.. The files have no file extensions so any check would be by examining the content and any fast way to rule out the file as non-ANSI would be perfect, since the import process needs to process 100-500 records a second.

NOTE: Over 100 different types of bad files have been sent to us, including images and PDF's. So there is a concern about whether you can easily and quickly rule out LTOS of different non ANSI types rather than specifically targeting just a few...

+1 nice answer.

Byron Whitlock 2009-09-18 22:19:59

The actual list of bad file types received over the last year exceeds 100. NUL is an allowed character in our ANSI-latin1 implementation, due to a legacy ghost script implementation. I've been reading up on the UTF BOM's, but if I understand them correctly, they're rarely used...Love the other check suggestions, though I wodner how many compression headers I would have to check to cover the most commonly used ones?

Shire 2009-09-18 22:22:36

Also might want to check whether the first line would import/looks like the your filetype.

Byron Whitlock 2009-09-18 22:24:23

I'd bet zip, bzip, rar, sit (mac), and gzip would give you > 90% coverage.

Byron Whitlock 2009-09-18 22:26:16

Love the new edits... It followed my initial thoughts when I thought the list was fairly small. Then when the EDI manager e-mailed me the spreadsheet of all the bad files and the file types they were able to determine they were later... Well... I started searching for a way to quickly check for LOTS of file types and rule them out in some way.

Shire 2009-09-18 22:29:40

Unfortunately, the first 100 characters are alphanumeric and can be any combination of 20 fields used to determine the primary keys… I will in the process decode those and if I can’t resolve them mark the record as an error. However, we allow them to have headers in the export file, please don’t ask why… So the rule is they are allowed to have 20% of the records in the file be bad… I’d rather not have to process 20% of 1 million records to determine the file is bad…

Shire 2009-09-18 22:36:17

Obviously I don't know much about your system, but there may well be a small number of filetypes that might account for half the bad inputs. Start with those, then add the more obscure ones as you learn which are more prevalent. Maybe you'll never get to 100% detection rate, but you ought to be able to make a big difference without a huge amount of effort.

RichieHindle 2009-09-18 22:42:53

Can you get away with sequential errors? Like if the first 5000 records error out with no good rows, the chances are pretty good it is junk. That way you don't have to process 20% or 1mill rows.

Byron Whitlock 2009-09-18 22:50:35

Well the current method is that if we have to correct mroe than 10 characters for a record it is bad. If we can't parse out the 3 primary keys for the record it is a bad record and on the 10th bad record, we mark the file as bad and roll back everything. So we have 100% detection already... But it is rather arbitrary and takes some processing, but it is not bad. But since I have to fix the workflow part as well, tryign to see if I can get better detection and hopefulyl report of what type of bad file we got...

Shire 2009-09-20 23:12:01

Just started working for this company last month and have suddenly inherited this program. It is my job to greatly improve the user experience. ;) I've also been looking at magic numbers and wondering if there is an library available that can detect these reliably without false positives...

Shire 2009-09-18 22:27:02

ansaurus

tags:

views:

answers:

How to detect that a file is not ANSI-Latin1?

related questions