tags:

views:

309

answers:

5

I know this isn't programming related, but I hope some feedback which helps me out the misery.

We've actually lots of and different data from our web applications, dating years back.

For example, we've

  • Apache logfiles
  • Daily statistics files from our tracking software (CSV)
  • Another daily statistics from nation-wide rankings for advertisement (CSV)
  • .. and I can probably produce new data from other sources, too.

Some of the data records started in 2005, some in 2006, etc. However at some point in time we start to have data of all of them.

What's I'm drea^H^H^H^Hsearching for is an application to understand all the data, lets me load them, compare individual data sets and timelines (graphically), compare different data sets within the same time span, allow me to filter (especially the Apache logfile); and of course this all should be interactively.

Just the BZ2 compressed Apache logfiles are already 21GB in total, growing weekly.

I've had no real success with things like awstats, Nihu Web Log Analyzer or similar tools. They can just produce statical information, but I would need to interactive query the information, apply filters, lay over other datas, etc.

I've also tried data mining tools in hope they can help me but didn't really success in using them (i.e. they're over my head), e.g. RapidMiner.

Just to make it sure: it can be a commercial application. But yet have to find something which is really useful.

Somehow I get the impression I'm searching for something which does not exist or I've the wrong approach. Any hints are very welcome.

Update:

In the end I it was a mixture of the following things:

  • wrote bash and PHP scripts to parse and managing parsing the log files, including lots of filtering capabilities
  • generated plain old CSV file to read into Excel. I'm lucky to use Excel 2007 and it's graphical capabilities, albeit still working on a fixed set of data, helped a lot
  • I used Amazon EC2 to run the script and send me the CSV via email. I had to crawl through around 200GB of data and thus used one of the large instances to parallelize the parsing. I had to execute numerous parsing attempts to get the data right, the overall processing duration was 45 minutes. I don't know what I could have done without Amazon EC2. It was worth every buck I paid for it.
+1  A: 

Splunk is a product for this sort of thing. I have not used it my self though. http://www.splunk.com/

Arthur Ulfeldt
A: 

In the interest of full disclosure, I've not used any commercial tools for what your describing.

Have you looked at LogParser? It might be more manual than what your looking for, but it will allow you to query many different structured formats.

As for the graphical aspect of it, there is some basic charting capabilities built in, but your likely to get much more mileage piping the log parser output into a tabular/delimited format and loading into Excel. From there you can chart/graph just about anything.

As for cross joining different data sources, you can always pump all the data into the database where you'll have a richer language for querying the data.

Zach Bonham
A: 

What you're looking for is a "data mining framework", i.e. something which will happily eat gigabytes of somewhat random data and then lets you slice'n'dice it in yet unknown ways to find the gold nuggets buried deep inside of the static.

Some links:

  • CloudBase: "CloudBase is a high-performance data warehouse system built on top of Map-Reduce architecture. It enables business analysts using ANSI SQL to directly query large-scale log files arising in web site, telecommunications or IT operations."

  • RapidMiner: "RapidMiner aleady is a full data mining and business intelligence engine which also covers many related aspects ranging from ETL (Extract, Transform & Load) over Analysis to Reporting."

Aaron Digulla
As I said, RapidMiner doesn't really cut it to me. CloudBase reads very interesting, but gathered from the documentation it looks still very raw, like it's a "lucene" without something built on top to make it usable out of the box like "Solr". Or in other words: out of the box support for loading diverse data files except SQL and a GUI to work with the data is not part of it, it seems to me. Thx
mark
A: 

The open source data mining and web mining software RapidMiner can import both Apache web server log files as well as CSV files and it can also import and export Excel sheets. Rapid-I offers a lot of training courses for RapidMiner, some also on web mining and web usage mining.

A: 

Mark,

What type of analyses did you find useful while parsing your web server/ad data? What interactive features would you most want? I'm considering making a web based program that could help with all the work you had to do.

Any insight based on your experience would be helpful!

DevX