views:

3840

answers:

6

I know that performance never is black and white, often one implementation is faster in case X and slower in case Y, etc. but in general - are B-trees faster then AVL or RedBlack-Trees? They are considerably more complex to implement then AVL trees (and maybe even RedBlack-trees?), but are they faster (does their complexity pay off) ?

Edit: I should also like to add that if they are faster then the equivalent AVL/RedBlack tree (in terms of nodes/content) - why are they faster?

A: 

THey are sed in different circumstances - B-trees are used when the tree nodes need to be kept together in storage - typically because storage is a disk page and so re-balancing could be vey expensive. RB trees are used when you don't have this constraint. So B-trees will probably be faster if you want to implement (say) a relational database index, while RB trees will probably be fasterv for (say) an in memory search.

anon
RB trees will not be faster for in memory search. That time is gone
Stephan Eggermont
+1  A: 

They're totally different in their use cases, so it's not possible to make a comparison.

RB trees are typically in-memory structures used to provide fast access (ideally O(logN)) to data. A lot of C++ STL implementations uses them in the set class. RB are always binary trees.

B-trees are typically disk-based structures, and so are inherently slower than in-memory data. Most database products use some form of B-tree (eg B*-tree) to manage their external data. B-trees are rarely binary trees, the number of children a node can have is typically a large number. The management of the B-tree structure can be quite complicated when the data changes.

B-tree try to minimize the number of disk accesses so that data retrieval is reasonably deterministic. It's not uncommon to see something like 4 B-tree access necessary to lookup a bit of data in a very database.

In most cases I'd say that in-memory RB trees are faster. Because the lookup is binary it's very easy to find something. B-tree can have multiple children per node, so on each node you have to scan the node to look for the appropriate child. This is an O(N) operation. On a RB-tree it'd be O(logN) since you're doing one comparison and then branching.

Sean
A few points: generally, the keys inside a B-Tree node are sorted, so searching is just logarathmic.For in memory data, the differences between a B-Tree and a RB tree are constant factors, not orders of magnitude.However, RB is still usually faster, and wastes less space.
Scott Wisniewski
-1 You are neglecting the effect of caching on RAM accesses. These days RAM is about half as far away from the CPU as the disk (taking the geometric mean as "half").
starblue
starblue is wrong: RAM is slower relative to CPU than it used to be, but it's still a hell of a lot faster than disk. 8ms is a fast disk seek, but RAM access is orders of magnitude faster than that.
Nick Johnson
Just to clarify a bit, B-trees try to reduce random accesses on disk, because moving the read heads back and forth takes much more time (and shortens the drive life span) than reading a number of sequential sectors.
TrayMan
"B-trees try to reduce [seeks]"; really? How? Compared to putting a red-black tree on the disk, a B-tree makes better use of each block transfered (by having big nodes), but how does it cut down on _seeks_ (as opposed to reads)? The only way I can see this happen is if you know the path and have the nodes (blocks) on the path laid out sequentially. Using multi-block nodes isn't going to help (much) because you you do binary search on them anyways.B-trees win over RB-trees as an on-disk data structure by cutting down on _reads_ (by a factor log-2 of the block size).
Jonas Kölker
starblue is right. Modern applications use BTree-like structures for acces to ram, not binary trees.
Stephan Eggermont
+16  A: 

Nothing prevents a B-Tree implementation that works only in memory. In fact, if key comparisons are cheap, in-memory B-Tree can be faster because its packing of multiple keys in one node will cause less cache misses during searches. See this link for performance comparisons. A quote: "The speed test results are interesting and show the B+ tree to be significantly faster for trees containing more than 16,000 items." (B+Tree is just a variation on B-Tree).

zvrba
This goes directly into my bookmarks folder.
Konrad Rudolph
answer is kinda weak, but the link is golden.
caspin
+20  A: 

Sean's post (the currently accepted one) is full of nonsense. Sorry Sean, I don't mean to be rude; I hope I can convince you that my statement is based in fact.

They're totally different in their use cases, so it's not possible to make a comparison.

They're both used for maintaining a set of totally ordered items with fast lookup, insertion and deletion. They have the same interface and the same intention.

RB trees are typically in-memory structures used to provide fast access (ideally O(logN)) to data. [...]

always O(log n)

B-trees are typically disk-based structures, and so are inherently slower than in-memory data.

Nonsense. When you store search trees on disk, you typically use B-trees. That much is true. When you store data on disk, it's slower to access than data in memory. But a red-black tree stored on disk is also slower than a red-black tree stored in memory.

You're comparing apples and oranges here. What is really interesting is a comparison of in-memory B-trees and in-memory red-black trees.

[As an aside: B-trees, as opposed to red-black trees, are theoretically efficient in the I/O-model. I have experimentally tested (and validated) the I/O-model for sorting; I'd expect it to work for B-trees as well.]

B-trees are rarely binary trees, the number of children a node can have is typically a large number.

To be clear, the size range of B-tree nodes is a parameter of the tree (in C++, you may want to use an integer value as a template parameter).

The management of the B-tree structure can be quite complicated when the data changes.

I remember them to be much simpler to understand (and implement) than red-black trees.

B-tree try to minimize the number of disk accesses so that data retrieval is reasonably deterministic.

That much is true.

It's not uncommon to see something like 4 B-tree access necessary to lookup a bit of data in a very database.

Got data?

In most cases I'd say that in-memory RB trees are faster.

Got data?

Because the lookup is binary it's very easy to find something. B-tree can have multiple children per node, so on each node you have to scan the node to look for the appropriate child. This is an O(N) operation.

The size of each node is a fixed parameter, so even if you do a linear scan, it's O(1). If we big-oh over the size of each node, note that you typically keep the array sorted so it's O(log n).

On a RB-tree it'd be O(logN) since you're doing one comparison and then branching.

You're comparing apples and oranges. The O(log n) is because the height of the tree is at most O(log n), just as it is for a B-tree.

Also, unless you play nasty allocation tricks with the red-black trees, it seems reasonable to conjecture that B-trees have better caching behavior (it accesses an array, not pointers strewn about all over the place, and has less allocation overhead increasing memory locality even more), which might help it in the speed race.

I can point to experimental evidence that B-trees (with size parameters 32 and 64, specifically) are very competitive with red-black trees for small sizes, and outperforms it hands down for even moderately large values of n. See http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html

B-trees are faster. Why? I conjecture that it's due to memory locality, better caching behavior and less pointer chasing (which are, if not the same things, overlapping to some degree).

Jonas Kölker
Although useful and making good points, I will not vote for a post with this hostile tone.
San Jacinto
This directly contradicts what a lot of algorithm books say. On the other hand, it actually makes sense. +1 for insight.
Konrad Rudolph
Algorithm books usually say something about the machine assumptions in the front matter, and those assumptions are simply no longer valid.
Stephan Eggermont
A: 

They all have the same asymptotic behavior, so the performance depends more on the implementation than which type of tree you are using. Some combination of tree structures might actually be the fastest approach, where each node of a B-tree fits exactly into a cache-line and some sort of binary tree is used to search within each node. Managing the memory for the nodes yourself might also enable you to achieve even greater cache locality, but at a very high price.

Personally, I just use whatever is in the standard library for the language I am using, since it's a lot of work for a very small performance gain (if any).

On a theoretical note... RB-trees are actually very similar to B-trees, since they simulate the behavior of 2-3-4 trees. AA-trees are a similar structure, which simulates 2-3 trees instead.

Jørgen Fogh
+14  A: 
Mecki
I wish I could give +10 to this answer, it's the best here (after the update, of course).
Luís Guilherme
this really is the best answer.
caspin