ansaurus

Question

How to deal with large data sets for analytics, and varying numbers of columns'?

Answer 1

A:

This is a case where you want to store the data once and read it over and over. Further I think that you'd wish the queries to be preprocessed instead of needing to be calculated on every go.

My suggestion for you is to store your data in CouchDB for the following reasons:

Its tables are structureless
Its queries are pre-processed
Its support for map-reduce allows your queries to handle group by
It has a REST service access model which lets you connect from pretty much anything that handle HTTP requests

You may find this suggestion a little out there considering how new CouchDB is. However I'd suggest for you to read about it because personally I think running a CouchDB database is sweet and lightweight. More light weight than MySQL

Am 2010-09-01 15:46:25

CouchDB looks very interesting for this purpose, particularly the way that views are stored on disk!

David Caunt 2010-09-01 16:45:32

Answer 2

A:

Keeping it in MySQL: If the amount of writes are limited / reads are more common, and the data is relatively simple (i.e: you can predict possible characters), you could try to use a text/blob column in the main table, which is updated with comma separated values or key/value pairs with an AFTER INSERT / UPDATE trigger on the join table. You keep the actual data in a separate table, so searching for MAX's / specific 'extra' attributes can still be done relatively fast, but retrieving the complete dataset for one of your 'views' would be a single row in the main table, which you can split into the separate values with the script / application you're using, relieving much of the stress on the database itself.

The downside of this is a tremendous increase in cost of updates / inserts in the join table: every alteration of data would require a query on all related data for a record, and a second insert into the 'normal' table, something like

UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache  = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).

However, as analytics data goes it usually trails somewhat, so possibly not every update has to trigger an update in cache, just a daily cronscript filling the cache with yesterdays data could do.

Wrikken 2010-09-01 15:52:37

Sorry, I should have made my question clearer. The system will be write heavy, with potentially millions of rows each day.

David Caunt 2010-09-01 16:17:17

Question is: is a record updated, or static for the day? Also: analytics systems are most commonly done with parsing logs once every X time (in almost any case, just logging to file is incredibly faster then any database, sql or nosql), not 'live'.

Wrikken 2010-09-01 20:50:22

ansaurus

tags:

views:

answers:

How to deal with large data sets for analytics, and varying numbers of columns'?

related questions