tags:

views:

408

answers:

8

I've got a C++ program that's likely to generate a HUGE amount of data -- billions of binary records of varying sizes, most probably less than 256 bytes but a few stretching to several K. Most of the records will seldom be looked at by the program after they're created, but some will be accessed and modified regularly. There's no way to tell which are which when they're created.

Considering the volume of data, there's no way I can store it all in memory. But as the data only needs to be indexed and accessed by its number (a 64-bit integer), I don't want the overhead of a full-fledged database program. Ideally I'd like to treat it as an std::map with its data stored on disk until requested.

Is there an already-written library that will do what I'm looking for, or do I need to write it myself?

EDIT: After some thought, I realized that Rob Walker's answer had a valid point: I'd be hard-pressed to get anywhere near the same kind of data integrity out of a home-brew class that I'd get from a real database.

Although BerkeleyDB (as suggested by RHM) looks like it would do exactly what we're looking for, the dual-licensing is a headache that we don't want to deal with. When we're done with the code and can prove that it would benefit noticeably from BerkeleyDB (which it probably would), we'll reexamine the issue.

I did look at Ferruccio's suggestion of stxxl, but I wasn't able to tell how it would handle the program being interrupted and restarted (maybe with changes). With that much data, I'd hate to just scrap what it had already completed and start over every time, if some of the data could be saved.

So we've decided to use an SQLite database, at least for the initial development. Thanks to everyone who answered or voted.

+4  A: 

BerkleyDB might be good for you. It indexes based on a string rather than a number, but you could format your number as hex. Supposed to be pretty much as fast as it gets for disk-based key/value lookup.

U62
I believe BerkeleyDB allows you to define comparison functions to support any kind of indexing, not just strings.
Ferruccio
+5  A: 

I doubt you will find a library that meets your requirements exactly, so you'll have to decide on what 'features' are really important to you and then decide if an existing DB solution comes close enough.

Billions of records is a large dataset by any stretch. What rate are records generated at? How long do they persist? Does the access pattern change over time?

Are updates always with the same amount of data as the original?

I would suggest proving definitively that a DB solution isn't going to work before starting to roll your own, particularly if integrity of the data is paramount (and it usually is...) Maintaining that volume of data on disk reliably can definitely be a challenge. Do you need any kind of transaction semantics when changing the data? Is the client multithreaded?

Rob Walker
+6  A: 

Take a look at STXXL.

stxxl::map<> looks like it does exactly what you need.

Ferruccio
No, it won't work for his data. First reason: STXXL wants all records to have exactly the same size (PODs). Second: The don't allow to save the data structure to disk and load it later, see here http://sourceforge.net/projects/stxxl/forums/forum/446474/topic/3537658
+1  A: 

you will probably have to roll your own. i'd probably stick it in a couple of mysql tables and lazy load a fixed sized map (lru). if you really wan't to avoid a db, place the < 256 or whatever length records in fixed record random access files and store the larger records as individual files.

Ray Tayek
A: 

Depending on the performance characteristics you need, the answer is different. But given just the information in the problem description, I think a DB is overkill, and could actually be counterproductive.

Saving each entry as a file whose name is its key (i.e. the key '1' corresponds to the file '1.dat' on disk) immediately after it's generated is a simple solution that avoids several problems. Assuming you have control over what filesystem the software is going to run on, if you pick a filesystem with good integrity, your data should have good integrity. You could write lots of code for grouping entries together in one file and then have to worry about resizing, or you could just let the filesystem handle that for you (it's designed to deal with files changing size). You could worry about writing them in a threadsafe manner into that file, or you could just let the filesystem handle that for you (filesystems are designed to have different processes writing to different files simultaneously). You could worry about files getting saved partially to disk and write code to check for it, or you could let the filesystem handle that for you (journaling and atomic writes). You could worry about scheduling writes of changes together for speed, or you could let the filesystem handle this for you (write caching).

Basically, a good filesystem and OS should handle all of this for you, and adding a database on top of it that tries to duplicate all of this functionality just creates more complexity and more potential for bugs. If you need to index the data by different fields, then a database may make sense, but in your description you said you only need to index the data by the same integer key every time.

Joseph Garvin
+2  A: 

I've used Gigabase http://www.garret.ru/gigabase.html, in several projects, it has a neat C++ interface, I've worked with millions of records without problems, it support rollback. It has MIT like license, also the author is very fast to answer questions and fix bugs.

Ismael
+2  A: 

You could use SQLLite which is a Open Source Database released to the public domain.

http://www.sqlite.org/

I`ll quote their page:

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.

And

Ongoing development and maintenance of SQLite is sponsored in part by SQLite Consortium members, including: Adobe, Symbian, Bloomberg, Mozilla

If you need a lightweight db this might just be it

A: 

I agree with others that BerkeleyDB, sqlite, or gigabase should be good solutions.

But writing your own solution should not be too hard either.

I have a simple solution, but there are three prerequisites:

  1. You can at least keep a std::vector<int64> with numkey elements in memory.
  2. Your keys are or can be made continuous.
  3. After the file is written, each data record size has a fixed maxsize, i.e. its size cannot be increased.

If these prerequisites are fulfilled the straightforward solution is to store the file position (int64) of each key (int64) in the vector in memory. For lookup, just retrieve the file position from the vector, seek to that position, where you find the record size as its first entry, and read size bytes.

'Fraid the problem precludes your #2 requirement: keys are only used once, and some will be deleted later (don't know which ones at first), breaking continuity. And requirement #1 may become problematic too, once the number of records grows.I'd wanted to avoid using an actual database, but so far that has proven to be the best solution. The project still isn't production-ready though, and this might change before it is.
Head Geek