tags:

views:

191

answers:

7

Is an index not similar to a dictionary? If you have the key, you can immediately access it?

Apparently indexes are sometimes stored as B-Trees... why is that?

+3  A: 

Database Indices are usually (almost always) stored as B-Trees. And all balanced tree structures have O(log n) complexity for searching.

'Dictionary' is an 'Abstract Data Type' (ADT), ie it is a functional description that does not designate an implementation. Some dictionaries could use a Hashtable for O(1) lookup, others could use a tree and achieve O(log n).

The main reason a DB uses B-trees (over any other kind of tree) is that B-trees are self-balancing and are very 'shallow' (requiring little disk I/O)

Henk Holterman
All balanced trees do. A sufficiently degenerate tree is a linked list.
Vatine
@Vatine: You're right, I'll edit that in.
Henk Holterman
+3  A: 

One of the only data structures you can access immediately with a key is a vector, which for a massive amount of data, becomes inefficient when you start inserting and removing elements. It also needs contiguous memory allocation.

A hash can be efficient but needs more space and will end up with collisions.

A B tree has a good balance between search performance and space.

Andres
+5  A: 

Dictionaries are not implicitly sorted, B-Trees are.

A B-Tree index can be used for ranged access, like this:

WHERE col1 BETWEEN value1 AND value2

or ordering, like this:

ORDER BY col1

You cannot immediately access a page in a B-Tree index: you need to traverse the child pages whose number increases logarithmically.

Some databases support dictionary-type indexes as well, namely, HASH indexes, in which case the search time is constant. But such indexes cannot be used for ranged access or ordering.

Quassnoi
A dictionary could be sorted. It just doesn't have to be.
Henk Holterman
@Henk: corrected to clarify this.
Quassnoi
@Henk: I think dictionaries are referring to hashtables with O(1) access. A dictionary could be sorted, but to do that sorting you will either have some linear structure (i.e. O(N) queries) or a tree structure (O(logN)) underneath.
Il-Bhima
@Il-Bhima: I know of Dictionary classes that are sorted.
Henk Holterman
@Henk: Ok, but how are they implemented? I would be very surprised (please tell me if this is so) to find a hash-table which can sort data without having some underlying ordered data-structure. You can define a Dictionary class interface on a tree, but that doesn't make it a dictionary in the usual sense.
Il-Bhima
@Il-Bhima: Sorted dictionaries will usually use a Tree.
Henk Holterman
Quassnoi: Sorry, I misread the child-pages part.
Henk Holterman
+1  A: 

hashindex (eg. in mysql and postgres) has constant complexity (O(1)) for search.

 CREATE INDEX ... USING HASH 
osgx
wouldn't it have constant time complexity? Linear is the worst possible, i.e. searching without an index.
Il-Bhima
@Il-Bhima, oh.. yes :) this was a some type of mind typo.
osgx
+2  A: 

If your only queries are equality tests then, its true, dictionaries are a good choice since they will do lookups in amortized O(1) time. However, if you want to extend queries to involve range checks, eg (select * from students where age > 10) then suddenly your dictionaries lose their advantage completely.. This is where tree-based indexes come in. With a tree-based index you can perform equalities and range checks in logarithmic time.

There is one problem with naive tree structures. They get unbalanced, this means that after adding certain values to the index, the tree structure become lopsided (ex like a long line) and lookups start to take O(N) time again. This can be resolved by balancing your tree. The B-Tree is one such approach which also takes advantage of systems capable of doing large blocks of I/O, and so is most appropriate for databases.

Il-Bhima
+1  A: 

You can achieve O(1) if you pre-allocate N entries an array and hash the key to this N values.

But then if you have more than N entries stored there are collision. So for each key in the array you have a list of value. So it's not exactly O(1) anymore. Scanning the list itself will be O(m) where m is the average number of collision.

Example with hash = n mod 3
0  -->  [0,a] [3,b] ...
1  -->  [1,a] [4,b] [7,b] ...
2  -->  [2,a] [5,b] ...

At a point in time, it becomes some bad that you spend more time traversing the list of value for a potential key than using another structure with O(log n) lookup time, where n is the total number of entries.

You could of course pick N so big that the array/hash would be faster than the B-Tree. But the array has a fixed pre-allocated size. So if N=1000 and you store 3 values, you've wasted 997 slots in the array.

So it's essentially a performance-space trade-off. For small set of value, array and hashing is excellent. For large set of value, B-Tree are most efficient.

ewernli
+1  A: 

Hashes are the fastest look up data structures, but have some problems:

a) are not sorted b) no matter how good the hash is, will have collisions, that becomes problematic when lots of data c) to find a hash value in a hash indexed file takes a long time, so most of the time hashes make sense only for in memory (RAM) data - which makes them not suitable for databases- that most of the time cannot fit all data in RAM

Sorted trees address these issues, and b-trees operations in particular can be implemented efficiently using files. The only drawback is they have slower lookup times as a hash structure.

No data structure is perfect in all cases, depending on estimated size of data and how you use it, one is better.

pt