views:

31

answers:

2

Ok everyone, I have an excellent challenge for you. Here is the format of my data :

ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP

ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.

My problem is as follow: I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.

First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.

Second is to store everything in memory into a huge hashtable. RAM is the limit.

Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?

When I say real-time, I mean about 100-200 query/second.

+2  A: 

A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.

I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.

Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.

Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.

mdma
A: 

Given the way the data is used, you should do the following:

  1. Create a record structure (fixed size) that is large enough to contain one full row of data
  2. Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
  3. Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.

Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.

Stephane