views:

22

answers:

3

We have run some scripts that use LogParser to dump our IIS logs into a SQL Server database.

We can then query this to get simple stats on hits, usage etc. It's also good when linking it to error log databases and performance counter database to compare usage with errors, etc.

Having implemented this for just one system and for the last 2-3 weeks we already have a 5GB database with around 10 million records.

This is making any queries to this database quite slow and will no doubt cause storage issues if we continue to log as we are.

Can anyone suggest any alternative databases that we could use for this data that would be more efficient for such logs? I'd be particularly interested in any experience of Google's BigTable or Amazon's SimbleDB.

Are either of these suitable for reporting queries? COUNTs, GROUP BYs, PIVOTs?

A: 

How often have you updated your indexes? What sort of queries on the data are you performing?

Perhaps you can perform routine collation of data at the end of every day to speed up other queries? (create new tables with this collated information)

Like a page hits table might have a record for each day on how many times that page was hit - that way you don't have to do a full table-scan on every query, you just hit the page-hits table.

A unique hosts table might have records for linger time, how many pages they hit, number of files downloaded, total bandwidth, session abandonement, unique cookies (different users, perhaps behind proxies or firewalls).

What sort of purge schedule are you planning, if any?

While it's nice to keep all that data around forever, especially for things you haven't thought of yet, the vast majority of what you want is in the collated data - so build your reports around that, and keep the raw data for those cases you really need something unique.

This is all stuff you'd have to build with a key-value store (like simpledb or bigtable) anyway.

Chris Kaminski
Currently all the data is stored and a few "ad-hoc" queries have been run for a few example reports, etc. Once we are more clear on what data we would like then we would absolutely run these reports, store the results and clear out the raw data. However, for now we need to keep the data till we know what we want. Also, I would still be unhappy about clearing out data younger than say 3 months and at this rate we would have to. Hence considering a "cloud" option.
Robin Day
Honestly, you're going to pay more for cloud storage than you would for a cheap server from dell with some SATA drives in it. Do the math, anything approaching 3-500GB/month is just not worth it.
Chris Kaminski
A: 

I think storage costs are going to be your biggest concern. Even if you go the cloud route I doubt you will be able to manage costs for that amount of data. My suggestion would be to move the data to ultra-cheap storage and deploy a solution that can operate on that data in an efficient manner.

For example, you could move log files from your server(s) to a local machine with a huge hard drive (and appropriate backup solution) and then run tools locally that can analyze the data. Log parser is effective if you can operate against a small subset of that data. You can run a database locally but even optimized queries might run slowly.

You might consider purchasing a log analysis tool like WebLog Expert to operate against these files.

Mayo
+1  A: 

I also faced similar problem before. Since the log file was growing so fast, I started to think if it is suitable to use database for IIS log. There are two points you may need to consider about:

  1. In most case we the IIS log can not provide useful information directly, we need to parse it to obtain statistic information.
  2. Also, in most case, IIS logs are not necessary to be ready in database for query.

It is recommended to keep all logs in files as they used to, but store weekly or monthly statistic information (processed periodically) in database, so that you have those essential data ready-to-serve.

ZEAXIF
+1 - By extracting the information you need from the log files on a regular basis, the liklihood of your needing to access the raw log files decreases with time. You might need to hit something in the past month or two for security reasons or to tweak your queries, but will you ever need to hit the log file from 2 years back?
Mayo