I know this isn't programming related, but I hope some feedback which helps me out the misery.
We've actually lots of and different data from our web applications, dating years back.
For example, we've
- Apache logfiles
- Daily statistics files from our tracking software (CSV)
- Another daily statistics from nation-wide rankings for advertisement (CSV)
- .. and I can probably produce new data from other sources, too.
Some of the data records started in 2005, some in 2006, etc. However at some point in time we start to have data of all of them.
What's I'm drea^H^H^H^Hsearching for is an application to understand all the data, lets me load them, compare individual data sets and timelines (graphically), compare different data sets within the same time span, allow me to filter (especially the Apache logfile); and of course this all should be interactively.
Just the BZ2 compressed Apache logfiles are already 21GB in total, growing weekly.
I've had no real success with things like awstats, Nihu Web Log Analyzer or similar tools. They can just produce statical information, but I would need to interactive query the information, apply filters, lay over other datas, etc.
I've also tried data mining tools in hope they can help me but didn't really success in using them (i.e. they're over my head), e.g. RapidMiner.
Just to make it sure: it can be a commercial application. But yet have to find something which is really useful.
Somehow I get the impression I'm searching for something which does not exist or I've the wrong approach. Any hints are very welcome.
Update:
In the end I it was a mixture of the following things:
- wrote bash and PHP scripts to parse and managing parsing the log files, including lots of filtering capabilities
- generated plain old CSV file to read into Excel. I'm lucky to use Excel 2007 and it's graphical capabilities, albeit still working on a fixed set of data, helped a lot
- I used Amazon EC2 to run the script and send me the CSV via email. I had to crawl through around 200GB of data and thus used one of the large instances to parallelize the parsing. I had to execute numerous parsing attempts to get the data right, the overall processing duration was 45 minutes. I don't know what I could have done without Amazon EC2. It was worth every buck I paid for it.