Data structure/algorithm for variable length record storage and lookup on disk withsearch only on primary keys

views:

1304

answers:

+2 Q:

Data structure/algorithm for variable length record storage and lookup on disk withsearch only on primary keys

I am looking for an algorithm / data structure that works well for large block based devices (eg a mechanical hard drive) which is optimised for insert, get, update and delete where searches are always done using the id of the data and where the data fields for any ID have a variable length.

The B-Tree seems to be a commonly quoted structure but mainly for fixed length records. I also expect dramatically more gets and updates than I do inserts and deletes. Can I get rid of the O(log m) lookup of the B-tree?

I am quite happy for it to be a combined system, for instance ISAM combines a B-tree and linear file storage which looks like it can be made to work with variable length records as an approach. Is there something better?

Some further constraints:

1) IDs are potentially sparse but they can be made to come in blocks of linear numbers - but in a large range (64 bit)

2) I don't want to use a DBMS, performance for my particular problem hasn't proved very good. I don't need any of the operations a full DBMS uses, I don't need search. I need something I can tweak and optimise easily. Call it academic curiosity, if it is out performed by MySQL then I'll use that but I have to try and go faster.

3) The dataset is larger than can fit in memory, the index however may well fit into memory if its as simple as key, offset. I am certainly looking at something like 1 billion entities or more in storage.

4) Ideally space should be recovered when a record is deleted. That may be via compaction but I am interested to see if there is a better way (A B-tree for instance recovers space easily).

If your IDs are numbers and not very sparse, one option would be to use a simple table of (offset, length) in one file, referencing the data in another file. This would get you O(1) lookup, and updates/inserts/deletes bound only by your free-space tracking mechanism.

bdonlan 2009-05-17 19:43:36

Even if they're not sparse, you might get O(1) by storing them in a hash map?

ChrisW 2009-05-17 20:45:08

I'm assuming the dataset is too large to fit in memory. Otherwise one can indeed just store things in a hash in memory, write an append-only log and periodically serialize a dump of all the data in the database.

bdonlan 2009-05-17 20:56:11

IMO a database engine will often have the index in RAM, even if the rest of the data is on disk.

ChrisW 2009-05-17 21:08:01

One would hope, but there's also a separate copy on disk in most cases, which is just mapped into memory for fast access. Since the OP hasn't given very many constraints about his problem it's hard comment further though.

bdonlan 2009-05-17 21:28:12

Best might be to use a commercial database engine.

You might get rid of any O(log m) lookup of a B-tree by storing the index, i.e. the {"logical ID" maps to "physical location"} value pairs, in a hash map (hashing on the logical ID) ... or, even, storing the index in a contiguous vector (with the logical ID used as an index into the vector of offset values), as bdonlan suggested, if the ID values aren't sparse.

An impotant implementation detail might be what the API you use to access the index: whether you store it in RAM (which the O/S backs with the system page file) and access it in-process using pointers, and/or store it on disk (which the O/S caches in the file system cache) and access it using file I/O APIs.

ChrisW 2009-05-17 20:57:46

One thing to note is that if the map is in cache, O(lg m) is really not all that expensive...

bdonlan 2009-05-17 21:28:39

It's still expensive: given even only one million records, a binary search takes 20 operations ... which is 20 times slower than one operation, which is significant if e.g. they're disk operations.

ChrisW 2009-05-17 21:42:27

20 operations? The number of operations in a B-tree is log_k n with k as the order of the tree. log _k n is of course O(lg m). In normal B-tree k will be much larger than 2. A B-tree is not a binary tree.

dmeister 2009-05-18 07:00:19

Ok, thanks for posting: now I see what a B-tree is, which I didn't before.

ChrisW 2009-05-18 09:10:51

If a database is to heavy weight for you, consider a key-value store.

If you really what to implement it yourself, use a disk-based hash table or a B-tree. To avoid the problems with the variable-length values, store the values in an separate file and use the B-tree as index for the data file. Space-reclaimation after deletion of values will be tricky, but it is possible (e.g. by a bitset for freed space in the data file).

dmeister 2009-05-18 07:03:30

+5 A:

The easy way: Use something like Berkeley DB. It provides a key-value store for arbitrary byte strings, and does all the hard work for you. It even provides 'secondary databases' for indexing, if you want it.

The do-it-yourself way: Use Protocol Buffers (or the binary format of your choice) to define B-Tree node and data item structures. Use an append-only file for your database. To write a new record or modify an existing record, you simply write the record itself to the end of the file, then write any modified B-Tree nodes (eg, the record's parent node, its parent node, and so forth up to the root). Then, write the location of the new root of the tree to the header block at the beginning of the file. To read the file, you simply find the most recent root node and read the B-Tree like you would in any other file. This approach has several advantages:

Since written data is never modified, readers don't need to take locks, and get a 'snapshot' view of the DB based on the root node at the time they started reading.
By adding 'previous version' fields to your nodes and records, you get the ability to access previous versions of the DB essentially for free.
It's really easy to implement and debug compared to most on-disk file formats that support modification.
Compacting the database consists of simply reading out the latest version of the data and B-Tree and writing it to a new file.

Nick Johnson 2009-05-18 10:23:00

ansaurus

tags:

views:

answers:

Data structure/algorithm for variable length record storage and lookup on disk withsearch only on primary keys

related questions