views:

312

answers:

4

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.

To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.

The question is how to balance and design the system so that ;

(1) the data can be fetched and presented to the user in real-time (<2 seconds),

(2) data can be aggregated on per-hour and per-day basis, and

(2) as large amount of data can still be stored in the warehouse, and

Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.

Is it possible it get real-time query performance from a "simple" data warehouse like this, and still have it store a lot of data (it would be nice to be able to never throw away any data)

Are there ways to aggregate the data into higher resolution tables?

I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.

Maybe I'm grasping for too much.

UPDATE

My schema looks like this;

  • dimensions:

    • client (ip-address)
    • server
    • url
  • facts;

    • timestamp (in seconds)
    • bytes transmitted
+1  A: 

Doesn't sound like it would be a problem. MySQL is very fast.

For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.

If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.

Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.

You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).

Seth
Thanks for the answer. Will research MySQL a bit more.The idea was to use a star schema where the timestamp of the log line was put in the fact table. This keeps per-log entry data low, but how do one go about to aggregate that kind of data? A client will most likely never-ever request the same entity again, so the same (client, asset, server) row will never exist twice in the table.
jrydberg
For collecting data, I'd make a single table with a bunch of columns - if you do anything fancier than that, you have to spend time opening other tables and doing computations while your server is under load. Since you indicated that you were already having trouble keeping up, you probably want to simplify as much as possible. If you set up a slave server to do normalization and aggregation (decoupling reporting from logging), then you can reduce the master load even further.
Seth
+1  A: 

Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.

Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.

Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case. Namely, the capped collections (do a search for mongodb capped collections for more info)

And the fast increment operation for things like keeping track of page views, hits, etc. http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

Thanks, I've looked at MongoDB to hold my non-relation data. Maybe it's a good match for other things too.Is it bad practice to have a unique dimension to every fact, such as as the client IP address? As I see it, this makes it impossible to aggregate the data into lower resolution tables. Or did I miss something?
jrydberg
If all you want to store is the client IP address, then you could store it as a degenerate dimension. It will still be ugly due to the high cardinality, but it can be done. You would want to avoid making it a separate dimension if you can because joining two extremely high cardinality tables is very tough on performance.
+1  A: 

Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.

If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.

With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.

Damir Sudarevic
Do I understand you correctly that any functions should not be used in the query? Do they effect performance that much? Is it quicker to do a join against a time-dimension?
jrydberg
Yes, that is correct -- keep in mind that a function has to be evaluated for every row of data. When set-up properly, the WHERE clause contains only dimension table fields, constants and ` = < > <= >= AND `; if you have a function then pre-calculate that in a dimension table.
Damir Sudarevic
Also for the query optimizer to use partition pruning, only ` = < > <= >= BETWEEN` is allowed. When the optimizer uses partition pruning, only partitions containing the WHERE data are scanned, others are ignored -- way faster.
Damir Sudarevic
A: 

This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.

Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.

But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.

As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.

Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

KenFar