views:

74

answers:

2

I need to write a C++ application that reads and writes large amounts of data (more than the available RAM) but always in a sequential way.

In order to keep the data in a future proof and easy to document way I use Protocol Buffer. Protocol buffer however does not handle large amounts of data.

My previous solution consisted on creating one file per data unit (and store them all in one directory), but this seems not particularly scalable.

This time I would like to try using an embedded database. To have the similar functionality I only need to store key->values associations (thus sqlite seems an overkill). Values will be the binary serialization output from Protocol Buffer.

I expect the database to manage the "what to keep in memory, what to move to disk asp" issue, the "how to efficiently store large amount of data on disk" issue, and ideally, to optimize my sequential read patterns (by reading before-hand the next entries).

Searching for alternatives I was surprised from the lack of alternatives. I do not want to keep the database in a separate process, because I not need this separation (this rules out redis).

The only option I found was Berkeley DB, but it has an unpleasant low level C api. Then, the best option I found was stldb4 on top of Berkeley DB. The API seems quite nice and fits my needs.

However I am worried. stldb4 seems a weird (it has dependencies on libferris stuff), unmaintained solution (last release one year ago), for a problem I would have though to be quite common.

Do any of you have a better suggestion on how to manage this issue ?

Thanks for your answers.

A: 

BerkleyDB seems to fit your needs. Sure, its API is a bit awkward, but if you rather get a nice API, SQLite might be better solution, even though I think its performance might not be as good.

Gianni
+1  A: 

I think I have found the answer to my problem.

I did not notice that Berkeley DB provides two APIs for C++:

This STL API provides STL compatible vectors and map abstractions that give direct access to the database. Thus doing value = data_container[key] becomes possible.

This seems to be the best solution for me; using Berkeley DB STL API directly together Protocol Buffers.

rodrigob