views:

423

answers:

10

I have a file containing 250 million website URLs, each with an IP address, page title, country name, server banner (e.g. "Apache"), response time (in ms), number of images and so on. At the moment, these records are in a 25gb flat file.

I'm interested in generating various statistics from this file, such as:

  • number of IP addresses represented per country
  • average response time per country
  • number of images v response time

etc etc.

My question is, how would you achieve this type and scale of processing, and what platform and tools wuld you use(in a reasonable time)?

I am open to all suggestions, from MS SQL on Windows to Ruby on Solaris, all suggestions :-) Bonus points for DRY (don't repeat yourself), I'd prefer not to write a new program each time a different cut is required.

Any comments on what works, and what's to be avoided would greatly be appreciated.

+4  A: 

Load the data into a table in a SQL Server (or any other mainstream db) database, and then write queries to generate the statistics you need. You would not need any tools other than the database itself and whatever UI is used to interact with the data (e.g. SQL Server Management Studio for SQL Server, TOAD or SqlDeveloper for Oracle, etc.).

MusiGenesis
A: 

25GB of flat file. I don't think writing any component on your own to read this file will be a good idea.

I would suggest that you should go for SQL import and take all the data to SQL Server. I agree that it would take ages to get this data in SQL Server, but once it is there you can do any thing you want with this data.

I hope once you put this data in DB, after that all you will get delta of information not 25 GB of flat file.

Pradeep
+7  A: 

Step 1: get the data into a DBMS that can handle the volume of data. Index appropriately.

Step 2: use SQL queries to determine the values of interest.

You'll still need to write a new query for each separate question you want answered. However, I think that is unavoidable. It should save you replicating the rest of the work.

Edited:

Note that although you probably can do a simple upload into a single table, you might well get better performance out of the queries if you normalize the data after loading it into the single table. This isn't completely trivial, but will likely reduce the volume of data. Making sure you have a good procedure (which will probably not be a stored procedure) for normalizing the data will help.

Jonathan Leffler
Great note about the normalization!
Mitchel Sellers
Normalizing doesn't always mean better performance (less data to store--yes). See http://highscalability.com/mother-all-database-normalization-debates-coding-horror and http://kottke.org/04/10/normalized-data . To your credit, you qualified with "might well get better performance".
micahwittman
in this case, i'd guess that a good normalisation would really help with performance, since it can turn most classification tasks into an integer comparison.
Javier
+1  A: 

One thing to keep in mind when you're importing the data is to try to create indexes that will allow you to do the kinds of queries you want to do. Think about what sort of fields will you be querying on and what those queries might look like. That should help you decide what indexing you will need.

+3  A: 

If you happen to use Windows, take a look at Log Parser. It can be found as a standalone download and also is included as part of the IIS Reource Kit.

Log Parser can read your logs and upload them to the Database.


Database Considerations:

For your Database Server you will want something that is fast (Microsoft SQL Server, IBM's DB2, PostgreSQL or Oracle). mySQL might be useful too but I have not experience with large Databases with it.

You will want all the memory you can afford. If you will be using the Database with regularity I'd say 4 GB at least. It can be done with less but you WILL notice big difference in performance.

Also, go for multicore/multi cpu servers if you can afford it and, again, if you will be using this Database with regularity.

Another recommendation is to analyze the king of queries you will be doing and plan the indexes accordingly. Remember: Every index you create will require additional storage space.

Of course, turn off the indexing or even destroy de indexes before masive data load operations. That will make the load lots faster. Re-index or re-create the indexes after the data load operation.

Now, if this Database will be an ongoing operation (i.e. is not just to investigate/analyze something and then discard it) you may want design a Database Schema with catalog and detail tables. This is called Database Normalization and the exact amount of normalization you will want depends on the usage pattern (data load operations versus query operations). An experienced DBA is a must if this Database will be used on an ongoing basis and have performance requirements.


P.S.

I will take the risk to include something obvious here but...

I think you may be interested in a Log Analyzer. These are computer programs that generate statistics from Web Server log files (some can analyze also ftp, sftp and mail server log files).

Web Log Analyzers generate reports with the statistics. Usually the reports are generated as HTML files and include graphics. There is a fair variety on depth analysis and options. Some are very customizable and some are not. You will find both commercial products and Open Source.

For the amount of data you will be managing, double check each candidate product and take a closer look on speed and ability to handle it.

vmarquez
on read-often/write-seldom patterns, MySQL really files.
Javier
A: 

You haven't said how the data in your flat file is organised. The RDBMS suggestions are sensible, but presume that your flat file is formatted in some delimited way and a db import is a relatively simple task. If that is not the case then you first have the daunting task of decompiling the data cleanly into a set of fields on which you can do your analysis.

I'm going to presume that your data is not a nice CSV or TXT file, since you haven't said either way and nobody else has answered this part of the possible problem.

If the data have a regular structure, even without nice clean field delimiters you may be able to turn an ETL tool onto the job, such as Informatica. Since you are a techy and this is a one-off job, you should definitely consider writing some code of your own which does some regex comparisons for extraction of the parts that you want and spits out a file which you can then load into a database. Either way you are going to have to invest some significant effort in parsing and cleansing your data, so don't think of this as an easy task.

If you do write your own code then I would suggest you choose a compiled language and make sure you process the data a single row at a time (or in a way that buffers the reads into manageable chunks).

Either way you are going to have a pretty big job making sure that the results of any process that you apply to the data have been consistently executed, you don't want IP addresses turing up as decimal numbers in your calculations. On data of that scale it can be hard to detect a fault like that.

Once you have parsed it then I think that an RDBMS is the right choice to store and analyse your data.

Simon
A: 

Is this a one time thing or will you be processing things on a daily, weekly basis? Either way check out vmarquez's answer I've heard great things about logparser. Also check out http://awstats.sourceforge.net/ it's a full fledged web stats application.

jms
A: 

SQL Server Analysis Services is designed for doing exactly that type of data analysis. The learning curve is a bit steep, but once you set up your schema you will be able to do any kind of cross-cutting queries that you want very quickly.

Jason DeFontes
A: 

If you have more than one computer at your disposal, this is a perfect job for MapReduce.

Adam Rosenfield
A: 

Sounds like a job for perl to me. Just keep count of the stats you want. Use regex to parse the line. It would probably take less than 10 minutes to parse that size file. My computer reads through a 2 gig file (13 million lines) in about 45 seconds with perl.

bruceatk