I have a date import project in which the clients send ANSI-latin1 encoded files (iso-8859-1). However... It seems that on a weekly basis we get a surprise file, one that is not of the correct format and the import basically dies horribly and needs manual intervention to recover and move on... Most common bad file formats seem to be excel, compress file or an XML/HTML file...
So in order to mitigate the human intervention, I would like to reasonably determine if we have a strong ANSI candidate file, before attempting to go through each line of the file looking for 1 of 64 bad characters and then making a guestimate on whether the whole line or file is bad on the # of bad characters found…
I was thinking of maybe making a Unicode/UTF check and/or magic number check or evening trying to check for a few specific application types.. The files have no file extensions so any check would be by examining the content and any fast way to rule out the file as non-ANSI would be perfect, since the import process needs to process 100-500 records a second.
NOTE: Over 100 different types of bad files have been sent to us, including images and PDF's. So there is a concern about whether you can easily and quickly rule out LTOS of different non ANSI types rather than specifically targeting just a few...