views:

530

answers:

3

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).

I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.

I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.

Ideally I'd like to deploy this on EC2.

+2  A: 

Wow. You are opening up a huge topic.

A few things right off the top of my head...

  1. think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
  2. look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
  3. look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
  4. prototype with some meaningful queries and some realistic data volumes
  5. get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
  6. think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos

as I say, huge topic. As I think of more I'll continue adding to my list.

HTH and good luck

Simon
Thanks! I've heard others recommend both Oracle and Microsoft's solutions. Am I to infer that open source databases aren't up to the task?
sanity
I think you're to infer that there are a lot more out-of-the-box solutions for what your looking at that are optimized for Oracle/MSSQL.
bouvard
If I say that the open source databases are not up to it then I'll get swamped under a tide of vote-downs. The reality (right or wrong) is that MSSQL and Oracle in particular are very stable proven database platforms and are generally preferred when your life is on the line.
Simon
+1  A: 

@Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:

  1. Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
  2. Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
  3. Separate the schemas for your transactional and data warehouse
  4. Seriously consider a periodic ETL from transactional db to the data warehouse.
  5. Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
  6. Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
  7. Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Ken Gentle
A: 

Wow.. This is a huge topic.

Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.

Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).

I believe you can approach this two ways

  • Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help

  • Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.

As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

AFHood