views:

153

answers:

3

Hello,

I'd like to design a JVM data structure (Java/Scala) that can be used to represent and store the contents of arbitrary relational database tables. The data structure should be fast (not too gc-intensive, cache-friendly) and memory efficient, so larger tables can fit in RAM.

One memory-efficient solution is to store each column separately in a primitive array, but I'm worried about the cache friendliness because items in the same row are not stored together. A row with N columns will incur N cache misses, no matter how narrow the columns.

Another solution is to store each row in an object array where each element represents a field and is cast to the correct type on retrieval, but this requires storing numeric types in their boxed form, so it's not very memory-efficient. And it's probably not that cache efficient either.

Another solution is to layout each row's data into a byte array the same way real databases serialize their rows, using only as many bytes as necessary. This is cache-friendly and memory efficient, but I'm concerned about the cost of serialization/de-serialization on every access.

What's the best way?

+1  A: 

What is the purpose of doing this? You are likely better simply storing the data that you retrieve from your database (as the objects you map it to) in some sort of caching layer like EhCache, OSCache, memcache, etc - rather than re-inventing the wheel.

matt b
It's for a main memory database side-project.
Seun Osewa
+1  A: 

Why not use hsqldb or h2?

They both support in-memory mode and are pure Java. They force you to use SQL to access but on the other end, you don't have to implement your own join.

Both are open source, so you can also use this as a baseline for performance and see if doing your own by column/by row data structure would be faster and be worth the effort.

huynhjl
HSQLdb allocates about 80 bytes per row for a table with just one integer column (i.e. 4 bytes of actual data). According to: http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#deployment_mem_disk-sect
Seun Osewa
+1  A: 

A fourth solution would be to store each row's data as strings instead of byte arrays. This may avoid serialization costs in most cases - provided that most data will be strings.

This will also be easier to debug and will be platform independent. Of course it has some limitations: e.g. a float can not be represented as-is, but may be stored in something similar to a SQL DECIMAL format.

Any solution will be a trade-off.

EDIT However, I would prefer the byte array solution for your case: one byte-array per row. This should be most cache-friendly for fixed-size rows. But then you should also provide a solution for variable-sized rows. A low-level language seems to fit that task better, in C one could define two formats: fixed size rows where the table metadata contains column-offsets (e.g. column 1: bytes 0..31, column 2: bytes 32..127 etc.), and a second variable size row format, where the rows itself contain the columns sizes (e.g. bytes 1..3 contain the size, the following number of bytes contain the data, then another 4 bytes contain the size, following data and so on).

frunsi