I've got a C++ program that's likely to generate a HUGE amount of data -- billions of binary records of varying sizes, most probably less than 256 bytes but a few stretching to several K. Most of the records will seldom be looked at by the program after they're created, but some will be accessed and modified regularly. There's no way to tell which are which when they're created.
Considering the volume of data, there's no way I can store it all in memory. But as the data only needs to be indexed and accessed by its number (a 64-bit integer), I don't want the overhead of a full-fledged database program. Ideally I'd like to treat it as an std::map
with its data stored on disk until requested.
Is there an already-written library that will do what I'm looking for, or do I need to write it myself?
EDIT: After some thought, I realized that Rob Walker's answer had a valid point: I'd be hard-pressed to get anywhere near the same kind of data integrity out of a home-brew class that I'd get from a real database.
Although BerkeleyDB (as suggested by RHM) looks like it would do exactly what we're looking for, the dual-licensing is a headache that we don't want to deal with. When we're done with the code and can prove that it would benefit noticeably from BerkeleyDB (which it probably would), we'll reexamine the issue.
I did look at Ferruccio's suggestion of stxxl, but I wasn't able to tell how it would handle the program being interrupted and restarted (maybe with changes). With that much data, I'd hate to just scrap what it had already completed and start over every time, if some of the data could be saved.
So we've decided to use an SQLite database, at least for the initial development. Thanks to everyone who answered or voted.