views:

448

answers:

8

I'm wondering what the most generally efficient tree structure would be for a collection that has the following requirements:

  1. The tree will hold anywhere between 0 and 232 - 1 items.
  2. Each item will be a simple structure, containing one 32-bit unsigned integer (the item's unique ID, which will be used as the tree value) and two pointers.
  3. Items will be inserted and removed from the tree very often; some items in the tree will remain there for the duration of the program, while others will only be in the tree very briefly before being removed.
  4. Once an item is removed, its unique ID (that 32-bit unsigned integer) will be recycled and reused for a new item.

The tree structure needs to support efficient inserts and deletions, as well as quick lookups by the unique ID. Also, finding the first available unused unique ID needs to be a fast operation.

What sort of tree would be best-suited for these requirements?

EDIT: This tree is going to be held only in memory; at no point will it be persisted to disk. I don't need to worry about hitting the disk, or disk caching, or anything of the sort. This is also why I'm not looking into using something like SQLite.

+5  A: 

Depending on how fast you need this to be you might just treat the whole thing as a single, in-memory table mmap-ed onto a file. Addressing is by direct computation. You can simply chain the free slots so you always know exactly where the next free one is. Most accesses will have a max of 1 or 2 disk accesses (depending on underlying filesystem requirements). Put a buttload of memory on the machine and you might not hit the disk at all.

I know this sounds pretty brute force, but you'd be amazed how fast it can be.

Update in response to: "I'm not looking for a disk-persistable solution"

Well, if you truly are going to have as many as 2^32 items in this structure (times how big it is) then you either need enough memory on the machine to hold this puppy or the kernel will start to swap things in and out of memory for you. This still translates to hitting the disk. If you let it swap, don't forget to check the size of the swap area, there's a good chance you'll have to bump it. Using mmap (or something similar) is sort of like creating your own private swap area and it will probably have less impact on other processes running on the same system.

I'll note that once this thing exceeds your available physical memory (whether you are using swap space or mmap or B-trees or Black-Red or extensible hashing or whatever) it becomes critical to understand your access pattern. If you are hopscotching all over the place you're going to be hitting the disk a lot. One of the primary reasons for using a structure like a B-tree (or any one of several similar structures) is that the top level of the tree (containing the index) tends to stay in memory (because most paging algorithms use LRU) and you only eat a disk access when you touch a leaf page.

Bottom line: it either fits in memory or it doesn't. If it doesn't then your 10^-9 sec memory access turns into a 10^-3 disk access. I.e. 1 million times slower. TANSTAAFL!

Peter Rowell
I will look into this. I'm developing this on Windows, so would MapViewOfFile (http://msdn.microsoft.com/en-us/library/aa366761%28VS.85%29.aspx) and related functions be the equivalent of what you're talking about?
Adam Maras
From just a quick glance it looks like it might be the equivalent. Note: We're talking a largish file here: 3 * 4 * 2^32 == 48GB approx. You may want/need to tune the filesystem. Linux uses ext3 as it's default and you can adjust the page size. I have no experience doing this with NTFS. Again, your need for speed may require deeper understanding of things like locality of reference. Also, is this being access by multiple processes, or just one? Locking could cause a major performance hit.
Peter Rowell
@Peter:It's even worse than that. Since there's more than 4 GB of data, he's probably not going to be able to get away with 32-bit pointers. Unfortunately, the next step up is usually 64-bit pointers, which is going to expand the data quite a bit...
Jerry Coffin
@Jerry: Unless the pointers are item IDs, in which case they will still be 32 bit. Of course the bottom line on this is, as it always is, you need to take a close look at the actual problem and the actual scale of the problem. You also have to look at the need/speed/value of the problem. If it's value is high enough you just throw more money/hardware at it. It hurts me to say that since I paid $13,000 for my first 1GB of disk ... and it was a bargain at the time (1986).
Peter Rowell
A minor point: I think that using this idea means it would not be necessary to store the 4-byte ID. The offset would imply the ID. That would save 16GB.
Mark Wilkins
@MarkW: Doh! You are completely correct, although I'm not sure that reducing the file size by 33% is 'a minor point'. Depending on the specs of the system involved it might make a huge difference in terms of hitting the cache instead of the disk.
Peter Rowell
I've edited my question to clarify; I'm not looking for a disk-persistable solution.
Adam Maras
+1  A: 

I would go for a red-black tree, because it balances the tree on insertion to ensure optimal insertion/deletion/retrieval. An AVL tree is an option, but it's slightly slower for insertions because it's more rigid about balancing on insertions.

http://en.wikipedia.org/wiki/Red-black%5Ftree

http://en.wikipedia.org/wiki/AVL%5Ftree

Shoko
+1  A: 

My reflex would tell me to reach for a standard implementation, such as the one in stl. But suppose you have reasons to implement your own I would typically go for either Red-Black Trees, which performs well on all operations. Alternatively I would try splay trees which can be really fast but have amortized complexity, i.e. some individual operations might take a little longer.

Stay away from AVL trees as you need to do a lot of updates. AVL trees are good for when you have a lot of lookups but few updates as the updated can be fairly slow.

svenningsson
+2  A: 

Do you expect your tree to really hold 2^32-1 entries? Even half that and I would definitely try this with SQLite. You may be able to fit it all in memory, but if you page once, a database will be faster. Database are meant to handle huge data sets efficiently, especially when the whole set won't fit in memory at once.

I you do intend to do this yourself, look at some database code and use a BTree. A red-black will be faster with smaller datasets but with that much data your bottle neck isn't going to be processor speed but memory and harddrive speed.

All that said I can't imagine a map of pointers that large being useful. You'll be pushing the limits of modern memory just storing the map. You won't have anything left over for the map to point to.

caspin
+2  A: 

Have you considered something like a trie? Lookup is linear in key length, which in your case means essentially constant, and storage can be more compact due to nodes sharing common substrings.

Keep in mind, though, that if your data set is actually filling large amounts of your key space your bigger efficiency concern is likely to be caching and disk access, not lookups.

camccann
I've edited my question to clarify; I'm not looking for a disk-persistable solution. I will look into possibly using a trie, though.
Adam Maras
A: 

boost::unordered_map has amortized constant time insertions, deletions and lookups. It's the best data structure for what you described.

Its only downside is that it's, well, unordered as the name says.. And also if you're REALLY unlucky it could end up being linear time if every single hash clashes. However that can be easily avoided using boost's default boost::hash function. Additionally hashing integers is trivial; so that worst case scenario will not happen to you.

(Note: it's not a tree but a hash table, and you asked specifically for a "Tree".. Maybe you thought that the most efficient way was some sort of tree (it's not)?)

Andreas Bonini
A: 

Why a tree at all?

To me it seems you need a database. If you expect lower count of nodes, Hash Table could be enough.

I'm going to warn you about the memory. If you fill up whole tree (2^32 items) you will need 4 gigabytes for the values themselves another 8GB for the pointers. Consider the database, if this is likely.

Kugel
A: 

Each item is represented by a 32-bit identity, which is its key, and two pointers. Are the pointers associated with the tree, or do they have to do with the identity?

If they're just part of implementing the tree, ditch them. You don't need them. Represent whether a number is there or not as a bit in a really big bitmap. Finding the lowest unused bit isn't fast, but I don't think it can be. It's only about 512M of main memory, which isn't that bad.

If the pointers are meaningful data, use an array. You're going to have to allocate space for four giganodes plus pointers to make up the map anyway, so allocate space for four giganodes plus one indicator each for whether the node is active or not. Use memset() to set the whole thing to zero, and keep a lowest-unused-node pointer. Use that to add a node. When you delete a node, mark it as unused, and use the pointers to maintain a two-way linked free list. You'll have to find the next lower unused node, and that might take a while, but again I don't see how to keep this fast. (If you just need an unused node, not the lowest one, just put the released node on the free list somewhere.)

This is likely to take about 64G or 96G of RAM, but that's less than a map solution.

David Thornley