views:

134

answers:

4

Given a data stream of continuously arriving items containing a timestamp and text (e.g. a search engine's query log), how would you store the data so that you could efficiently retrieve totals over time to plot trend lines per term?

A row-oriented database with tuples like (term, date, count) would work but would not scale with a large number of different terms. What alternative data structures should be considered in this context (e.g. column-oriented store)? Fast inserts are an important requirement.

+2  A: 

You're wrong in your assertion that column-oriented DBMSs are more efficient than row-oriented ones, it's quite the opposite. The performance of single row inserts into column-oriented DBMSs would be abysmal in your scenario - they are not optimized for insert performance but for read-only querying. Definitely not for single-row inserts.

How fast is 'fast'? Hundreds of writes per second is definitely not a big deal, provided that there's sufficient I/O (fast hard drives) available. Is the overall data small enough to fit into RAM? Normal RDBMSs are still the safest bet, but nowadays there are also in-memory engines available too which greatly outperform the traditional disk-based ones

For aggregation and subsequent reporting, you could either use summarized tables, or a common built-in feature called Materialized Views.

Andrew from NZSG
Thanks for the comment on the column-oriented store.Regarding your question, the data volume is very high, probably hundreds of entries per second. It is higher than the disk write speed, thus items are being discarded. I am looking for a scalable solution that can handle a high volume data stream for a long period of time.
ssn
Hundreds of writes per second is absolutely no big deal for a proper database engine such as Oracle, SQL Server or MySQL. You just need to get an entry-level RAID hard disk array (~10K USD).
Andrew from NZSG
+1  A: 

This may not be immediately helpful (because these technologies are not readily available yet), but here is an interesting podcast about stream-oriented databases. The speaker (Michael Stonebraker) is trying to sell his product of course, but it is still well worth the listen, especially since Stonebraker is one of the founding fathers of RDBMS. His main point seems to be that traditional disk-based architectures are an order of magnitude (or more) too slow for what he needs to do, with (redundant) in-memory solutions being the way to go here.

Also, Hadoop is supposed to be great for batch processing of huge log files. I do not think this would give you real-time data, though.

Thilo
+1  A: 

Since the OP says (in a comment) that "the data volume is very high, probably hundreds of entries per second. It is higher than the disk write speed," it sounds like the data is being aggregated from a number of servers. My suggestion would be to keep the storage task distributed to the individual servers.

What front end web servers are you using? Apache has a module for logging to a db. Or use log rotate and pick up the files on a regular basis.

Aggregate, using Hadoop, or probably better, pig, when you want to look at and analyze the data. Don't try to turn it into one giant firehose of data unless you really really need to.

pig: http://hadoop.apache.org/pig/

pig training video: http://www.cloudera.com/hadoop-training-pig-introduction

Larry K
Aggregation with Hadoop would not allow online selection of totals per term. Also, I do have a single giant firehose of data to process. Data arrives from a central broker that aggregates from hundreds of servers.
ssn
+1  A: 

A few thoughts:

If it's true that the amount of data exceeds your disk write speeds, then you will have to either increase your disk write speeds (eg: RAID, faster disks, ram disks) or distribute the load across many servers. And if scalability is your main concern, then distribution is the key. Unfortunately, I cannot provide more wisdom on the issue than that (Larry K has some links that may help).

I can get 30MB/sec sustained write to a 2.5" 7200 rpm drive without trying very hard, so I'd suspect you'd need a lot more search engine queries than "hundreds per second" to exceed that. In any case, most relational databases don't do very well with lots of singe row writes. Here are some alternatives:

  1. Investigate if your DBMS supports some sort of batching or bulk insert option (SQL server's BulkCopy classes dramatically improves insert performance). Buffer some items into one batch and write them in the background.

  2. Remove indexes, foreign keys from your table. These slow down inserts.

  3. Minimise the amount of data you need to write. Maybe have one table per half hour of the day, then you won't need to save the timestamp (if your aggregation only needs half hour resolution). Compress the search string (using gzip or even just UTF8 might help). See if using some tricky bit mashing can let you store more data in a smaller space.

  4. Ditch the DBMS altogether. Open a file exclusively and append fixed length records. Rotate the file every half hour. Then get some other process (or even other server) to read these files and aggrigate them as required. All DBMS's lose some performance over plain files because of type checking, parsing, transactions, etc. And if performance is your top priority, then you'll have to do without all the bells and whistles DBMS's provide.

Regarding your point 2. - a table without indices would slow down retrieval. I need to SELECT all totals per term to plot the trend line. Using a file as a buffer to group inserts seems a good idea.
ssn
That's the price you pay for a DBMS. Indexes make selects faster at the cost of inserts. Sounds like flat files are a good place to start.