views:

866

answers:

5

I'm researching possible options how to organize data storage for an Erlang application. The data it supposed to use is basically a huge collection of binary blobs indexed by short string ids. Each blob is under 10 Kb but there are many of them. I'd expect that in total they would have size up to 200 Gb so obviously it cannot fit into memory. The typical operation on this data is either reading a blob by its id or updating a blob by its id or adding a new one. At each given period of day only a subset of ids is being used so the data storage access performance might benefit from in-memory cache. Speaking about performance - it is quite critical. The target is to have around 500 reads and 500 updates per second on commodity hardware (say on EC2 VM).

Any suggestions what to use here? As I understand dets is out of question as it is limited to 2G (or was it 4G?). Mnesia probably out of question too; my impression is that it was mainly designed for cases when data fits memory. I'm considering trying EDTK's Berkeley DB driver for the task. Would it work in the above scenario? Does anybody have experience using it in the production in the similar conditions?

A: 

Mnesia can store data on disk just fine. There's also dets (disk based term storage) which is roughly analogous to Berkeley DB. It's in the standard lib: http://www.erlang.org/doc/apps/stdlib/index.html

Dets is unusable in my project - quote from documentation: "The size of Dets files cannot exceed 2 GB". Mnesia is based on dets too so it inherits its restrictions. As as a workaround one can do partitioning but I suspect the performance will suffer. From my limited testing dets is rather slow.
Ilya Martynov
I'd guess the 2GB dets limit only exists on 32bit arch... Ask the erlang mailing list, probably better than here for erlang, anyhow.
+1  A: 

Have you looked at what CouchDB is doing? It might not be quite what you are after as a drop in product, but there is lots of erlang code in there for storing data. There is also some talk of providing a native erlang interface instead of the REST api.

kerrr
+4  A: 

tcerl came out of facing the same size limit. I'm not using Erlang these days but it sounds like what you're looking for.

Darius Bacon
That's for reply except that it is a bit too late - I already playing with tcerl in my application :)
Ilya Martynov
+1  A: 

Is there any reason why you can't just use a file system, treating filename as your string id and file contents as a binary blob? You can choose one (filesystem) that fits your performance requirements, and you should get caching basically for free, provided by your OS.

Michał Kwiatkowski
Actually tried this and found this to be somewhat slower then tcerl based implementation. Though I didn't bother to tune filesystem and besides while it was slower then tcerl it was fast enough for my requirements at least in basic benchmarks.
Ilya Martynov
A: 

I would recommend Apache CouchDB.

It's a great fit for Erlang, and from the sound of it (you mention ID-based blobs and don't mention any relational requirements) you're looking for a document-oriented database.

Since the interface is REST, you can very simply add a commodity HTTP cache in front of it if you need caching.

The documentation for CouchDB is of a very high quality.

It also has built-in Map-Reduce :)

bjnortier