views:

525

answers:

3

I am currently researching on a backend to use for a project with demanding data aggregation requirements. The main project requirements are the following.

  1. Store millions of records for each user. Users might have more than 1 million entries per year so even with 100 users we are talking about 100 million entries per year.

  2. Data aggregation on those entries must be performed on the fly. The users need to be able to filter on the entries by a ton of available filters and then present summaries (totals , averages e.t.c) and graphs on the results. Obviously I cannot precalculate any of the aggregation results because the filter combinations (and thus the result sets) are huge.

  3. Users are going to have access on their own data only but it would be nice if anonymous stats could be calculated for all the data.

  4. The data is going to be most of the time in batch. e.g the user will upload the data every day and it could like 3000 records. In some later version there could be automated programs that upload every few minutes in smaller batches of 100 items for example.

I made a simple test of creating a table with 1 million rows and performing a simple sum of 1 column both in mongodb and in mysql and the performance difference was huge. I do not remember the exact numbers but it was something like mysql = 200ms , mongodb = 20 sec.

I have also made the test with couchdb and had much worse results.

What seems promising speed wise is cassandra which I was very enthusiastic about when I first discovered it. However the documentation is scarce and I haven't found any solid examples on how to perform sums and other aggregate functions on the data. Is that possible ?

As it seems from my test (Maybe I have done something wrong) with the current performance its impossible to use mongodb for such a project although the automated sharding functionality seems like a perfect fit for it.

Does anybody have experience with data aggregation in mongodb or have any insights that might be of help for the implementation of the project ?

Thanks, Dimitris

+2  A: 

If you're looking for a very high performance DBMS and don't need it to be relational, you might consider Cassandra - although its advantages only come into play if you have a database cluster instead of a single node.

You didn't say what limits there are on the physical architecture. You did mention sharding which implies a cluster. IIRC MySQL clusters support sharding too.

It'd also be very useful to know what level of concurrency the system is intended to support, and how data would be added (drip-feed or batch).

You say "Obviously I cannot precalculate any of the aggregation results because the filter combinations (and thus the result sets) are huge."

This is your biggest problem, and will be the most important factor in determining the performance of your system. Sure, you can't maintain materialized views of every possible combination, but your biggest performance win is going to be maintaining limited pre-aggregated views and building an optimizer that can find the nearest match. It's not all that hard.

C.

symcbean
Hi thanks for the response.I have updated my original post with some more info. Your last comment about pre-aggregated views and a nearest match optimizer is what I have been thinking about doing when researching the project. My problem is that I have no idea on how to go about it and where to start. Every record will have many flags that will be used as filters plus the user will be able to filter by arbitrary date intervals. Then on the result set I would have to calculate o ton of averages sums and counts. How many times this flag was true e.t.cAny more pointers appreciated
Dimitris Stefanidis
+2  A: 

I've never been impressed by the performance of MongoDB in use cases where javascript is required, for instance map-reduce-jobs. Maybe it is better in 1.51. I didn't try.

You could also try the free single node edition of Greenplum: http://www.greenplum.com/products/single-node/ and http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/

TTT
Thanks. The greenplum link was very helpful since I discovered there are many dbs out there like greenplum that are optimized for what I need to do. Massive data analytics. Another one is infobright that plugs in to mysql and seems promising.
Dimitris Stefanidis
+1  A: 

Or maybe hadoop ( http://hadoop.apache.org/ ) or hadoopdb ( http://db.cs.yale.edu/hadoopdb/hadoopdb.html ) ?

TTT