views:

138

answers:

2

For a huge number of huge csv files (100M lines+) from different sources I need a fast snippet or library to auto-guess the date format and convert it to broken-down time or unix time-stamp. Once successfully guessed the snippet must be able to check subsequent occurrences of the date field for validity because it is likely that the date format changes throughout the file.

The test set of date formats must be variable but compiling an optimal decision tree or something from a number of given date formats is fine.

I've come to the conclusion that nothing of the kind exists but yet have to do a `market research' hence my question.

My first attempt was to mimic getdate() for 23 different date formats I've observed so far, and to replace the number parsers by optimised versions taking date-specific characteristics into account (no '4' to '9' in the tenners of the day part, no '3' to '9' in the tenners of the month part, etc.)

Did anyone face a similar problem or even produce code of the kind?

+1  A: 

I dealt with timestamped sensor data (structurally CSV) in over fifty formats from numerous sources with a Perl script. Never constrained for functionality, and although it is script based it was reasonably quick (>10Klines/sec where line was ~60-100chars) I implemented a) analyse first couple of hundred lines, rewind and then do the run ...to build up context for decision logic. b) emit erroneous line(s) with line number and context ...so at the end of the run could edit the offending lines then set them to be re-inserted on a subsequent run, so it could pass "patched" error free ie every line would have matched a format. c) time difference between lines ...only allowed increasing timestamps. d) also I could reformat other stuff like changing units ie imperial to SI. Although from the C camp, simple Perl is not too alien, but made it so so much easier Note This method could deal with problems like 10/04/05 ie DD/MM/YY or MM/DD/YY if there was enough information in the file

Roaker
Nice one, mind sharing your code? 10k lines/sec was roughly what I had in mind. I started a simple compiler that takes up to 64 format specs and emits code to do incremental refinement, so in the end I get a bit mask where a set bit denotes that the corresponding format spec did hold throughout the whole file. Could do with your code to validate mine.
hroptatyr
If you get a page of test data to me, I could knock up a demonstrator, just to ease you along the learning curve. You might want to brush up on your regular expressions, as this is used to select your desired data.
Roaker
I picked 6 sources, 10000 lines each.http://qaos.math.tu-berlin.de/~freundt/for_roaker.tar.bz2
hroptatyr
+1  A: 

After two weeks of excessive googl^Wweb browsing I came to the conclusion that I have to write this one myself. FTW, my first go at it: http://github.com/hroptatyr/glod

hroptatyr