views:

514

answers:

8

I need to store some data that follows the simple pattern of mapping an "id" to a full table (with multiple rows) of several columns (i.e. some integer values [u, v, w]). The size of one of these tables would be a couple of KB. Basically what I need is to store a persistent cache of some intermediary results.

This could quite easily be implemented as simple sql, but there's a couple of problems, namely I need to compress the size of this structure on disk as much as possible. (because of amount of values I'm storing) Also, it's not transactional, I just need to write once and simply read the contents of the entire table, so a relational DB isn't actually a very good fit.

I was wondering if anyone had any good suggestions? For some reason I can't seem to come up with something decent atm. Especially something with an API in java would be nice.

+1  A: 

Have you looked at Berkeley DB? That sounds like it may fit the bill.


Edit:

I forgot to add you can gzip the values themselves before you store them. Then just unzip them when you retrieve them.

grieve
+2  A: 

This sounds like a job for.... new ObjectOutputStream(new FileOutputStream(STORAGE_DIR + "/" + key + ".dat"); !!

Seriously - the simplest method is to just create a file for each data table that you want to store, serialize the data into and look it up using the key as the filename when you want to read.

On a decent file system writes can be made atomic (by writing to a temp file and then renaming the file); read/write speed is measured in 10s of MBit/second; look ups can be made very efficient by creating a simple directory tree like STORAGE_DIR + "/" + key.substring(0,2) + "/" + key.substring(0,4) + "/" + key which should be still efficient with millions of entries and even more efficient if your file system uses indexed directories; lastly its trivial to implement a memory-backed LRU cache on top of this for even faster retrievals.

Regarding compression - you can use Jakarta's commons-compress to affect a gzip or even bzip2 compression to the data before you store it. But this is an optimization problem and depending on your application and available disk space you may be better off investing the CPU cycles elsewhere.

Here is a sample implementation that I made: http://geek.co.il/articles/geek-storage.zip. It uses a simple interface (which is far from being clean - its just a demonstration of the concept) that offers methods for storing and retrieving objects from a cache with a set maximum size. A cache miss is transfered to a user implementation for handling, and the cache will periodically check that it doesn't exceed the storage requirements and will remove old data.

I also included a MySQL backed implementation for completion and a benchmark to compare the disk based and MySQL based implementations. On my home machine (an old Athlon 64) the disk benchmark scores better then twice as fast as the MySQL implementation in the enclosed benchmark (9.01 seconds vs. 18.17 seconds). Even though the DB implementation can probably tweaked for slightly better performance, I believe it demonstrates the problem well enough.

Feel free to use this as you see fit.

Guss
On the one hand I totally agree, on the other hand this is loads of small objects(so more overhead) and I'll end up doing file management manually (I wanna use it as sort of an L2 cache, there's not enough disk space to store it all)
wds
Yes, if you can't store all your data then this might not be the best solution. You'd need to implement an LRU in persistent storage which is difficult. I'll think about it.
Guss
I'm gonna go ahead and accept this. Haven't quite implemented it yet, but it'll be a hybrid between this and some other stuff. BTW, writing to a temp file and renaming it is not such a good idea for atomic writes (in unix).
wds
Regarding the LRU stuff, I wrote something up which may be of interest.Regarding the atomic write - rename is guaranteed to be atomic (only one of multiple concurrent writer will prevail), the only problem is "renaming" across file systems, which is just writing a new file and thus not atomic.
Guss
I added a demo implementation. Please take a look and tell me what you think.
Guss
That's pretty cool. Works okay it seems. I sorta refactored out of this and put in more computing power, but it remains a nice proof of concept. Regarding temp-file-rename, write-rename-crash might result in a 0-length file.
wds
Thanks. I'm not sure how a crash could result in a 0-length file. If you are renaming in the same drive, then either the "update directory" succeeded or it didn't - there's byte moving involved so there is no 0-sized file created at any point.
Guss
+2  A: 

I'd use EHCache, it's used by Hibernate and other JEE libraries, and is really simple and efficient:

To add a table:

List<List<Integer>> myTable = new(...)
cache.put(new Element("myId", myTable));

To read:

List<List<Integer>> myTable = (List<List<Integer>>) cache.get("myId").getObjectValue();
Abdullah Jibaly
A: 

Apache Derby might be a good fit if you want something embedded (not a separate server).

There is a list of other options at Lightweight Data Bases in Java

CoverosGene
A: 

It seems that Key=>Value Databases are the thing you search for.

Maybe SuperCSV is the best framework for you!

If you don't want to use a relational database, you can use JAXB to store your Objects as XML files!

There is also a way with other libraries like XStream

If you prefer XML, then use JAXB or XStream. Otherwise you should have a look at CSV libraries such as SuperCSV. People who can life with serialized java files can use the default persistance mechanism like Guss said. Direct Java persistance may be the fastest way.

Martin K.
A: 

You can use JOAFIP http://joafip.sourceforge.net/ It make you able to put all your data model in file and you can access to it, update it, without reloading all in memory.

luc peuvrier
A: 

If you have a couple of KB, I don't understand why you need to "compress the size of this structure on disk as much as possible" Given that 181 MB of disk space costs 1 cent, I would suggest that anything less than this isn't worth spending too much time worrying about.

However to answer your question you can compress the file as you write it. As well as ObjectOutputStream, you can use XMLExcoder to serialize your map. This will be more compact than just using ObjectOutputStream and if you decompress the file you will be able to read or edit the data.

XMLEncoder xe = new XMLEncoder(
    new GZIPOutputStream(
        new FileOutputStream(filename+".xml.gz")));
xe.writeObject(map);
xe.close();
Peter Lawrey
Ehm, the grammar of your first paragraph makes little sense? It's a couple TB in total, I just wanna keep a couple of gigs of that around (access patterns are sort of predictable).
wds
Thanks for pointing out the grammar was rubbish. If you need to store a few GB you must have a lot of tables (if each table is about a couple of KB) Perhaps you could simplify the schema so you don't have 100,000s of tables.
Peter Lawrey
That's just a problem with the way I phrased it I guess. The data is fully normalized so there's no redundancy involved. I just noticed I forgot to mention it does add up to a lot in total.
wds
A: 

Well, this is not Java, but a quick solution < 10 minutes: y_serial.py module :: warehouse Python objects with SQLite

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."

http://yserial.sourceforge.net

Compression is seamlessly integrated into yserial which is nice because that dramatically reduces the size of BLOBs.

(Kindly tell us if the module works under Jython -- thanks ;-)

code43