views:

206

answers:

2

For our application, we keep large amounts of data indexed by three integer columns (source, type and time). Loading significant chunks of that data can take some time and we have implemented various measures to reduce the amount of data that has to be searched and loaded for larger queries, such as storing larger granularities for queries that don't require a high resolution (time-wise).

When searching for data in our backup archives, where the data is stored in bzipped text files, but has basically the same structure, I noticed that it is significantly faster to untar to stdout and pipe it through grep than to untar it to disk and grep the files. In fact, the untar-to-pipe was even noticeably faster than just grepping the uncompressed files (i. e. discounting the untar-to-disk).

This made me wonder if the performance impact of disk I/O is actually much heavier than I thought. So here's my question:

Do you think putting the data of multiple rows into a (compressed) blob field of a single row and search for single rows on the fly during extraction could be faster than searching for the same rows via the table index?

For example, instead of having this table

CREATE TABLE data ( `source` INT, `type` INT, `timestamp` INT, `value` DOUBLE);

I would have

CREATE TABLE quickdata ( `source` INT, `type` INT, `day` INT, `dayvalues` BLOB );

with approximately 100-300 rows in data for each row in quickdata and searching for the desired timestamps on the fly during decompression and decoding of the blob field.

Does this make sense to you? What parameters should I investigate? What strings might be attached? What DB features (any DBMS) exist to achieve similar effects?

+4  A: 

This made me wonder if the performance impact of disk I/O is actually much heavier than I thought.

Definitely. If you have to go to disk, the performance hit is many orders of magnitude greater than memory. This reminds me of the classic Jim Gray paper, Distributed Computing Economics:

Computing economics are changing. Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth. This has implications for how one structures Internet-scale distributed computing: one puts computing as close to the data as possible in order to avoid expensive network traffic.

The question, then, is how much data do you have and how much memory can you afford?

And if the database gets really big -- as in nobody could ever afford that much memory, even in 20 years -- you need clever distributed database systems like Google's BigTable or Hadoop.

Jeff Atwood
A: 

I made a similar discovery when working within Python on a database: the cost of accessing a disk is very, very high. It turned out to be much faster (ie nearly two orders of magnitude) to request a whole chunk of data and iterate through it in python than it was to create seven queries that were narrower. (One per day in question for the data)

It blew out even further when I was getting hourly data. 24x7 lots of queries it lots!

Matthew Schinckel