For a huge number of huge csv files (100M lines+) from different sources I need a fast snippet or library to auto-guess the date format and convert it to broken-down time or unix time-stamp. Once successfully guessed the snippet must be able to check subsequent occurrences of the date field for validity because it is likely that the date format changes throughout the file.
The test set of date formats must be variable but compiling an optimal decision tree or something from a number of given date formats is fine.
I've come to the conclusion that nothing of the kind exists but yet have to do a `market research' hence my question.
My first attempt was to mimic getdate() for 23 different date formats I've observed so far, and to replace the number parsers by optimised versions taking date-specific characteristics into account (no '4' to '9' in the tenners of the day part, no '3' to '9' in the tenners of the month part, etc.)
Did anyone face a similar problem or even produce code of the kind?