Is an index not similar to a dictionary? If you have the key, you can immediately access it?
Apparently indexes are sometimes stored as B-Trees... why is that?
Is an index not similar to a dictionary? If you have the key, you can immediately access it?
Apparently indexes are sometimes stored as B-Trees... why is that?
Database Indices are usually (almost always) stored as B-Trees. And all balanced tree structures have O(log n) complexity for searching.
'Dictionary' is an 'Abstract Data Type' (ADT), ie it is a functional description that does not designate an implementation. Some dictionaries could use a Hashtable for O(1) lookup, others could use a tree and achieve O(log n).
The main reason a DB uses B-trees (over any other kind of tree) is that B-trees are self-balancing and are very 'shallow' (requiring little disk I/O)
One of the only data structures you can access immediately with a key is a vector, which for a massive amount of data, becomes inefficient when you start inserting and removing elements. It also needs contiguous memory allocation.
A hash can be efficient but needs more space and will end up with collisions.
A B tree has a good balance between search performance and space.
Dictionaries are not implicitly sorted, B-Tree
s are.
A B-Tree
index can be used for ranged access, like this:
WHERE col1 BETWEEN value1 AND value2
or ordering, like this:
ORDER BY col1
You cannot immediately access a page in a B-Tree
index: you need to traverse the child pages whose number increases logarithmically.
Some databases support dictionary-type indexes as well, namely, HASH
indexes, in which case the search time is constant. But such indexes cannot be used for ranged access or ordering.
hashindex (eg. in mysql and postgres) has constant complexity (O(1)) for search.
CREATE INDEX ... USING HASH
If your only queries are equality tests then, its true, dictionaries are a good choice since they will do lookups in amortized O(1) time. However, if you want to extend queries to involve range checks, eg (select * from students where age > 10
) then suddenly your dictionaries lose their advantage completely.. This is where tree-based indexes come in. With a tree-based index you can perform equalities and range checks in logarithmic time.
There is one problem with naive tree structures. They get unbalanced, this means that after adding certain values to the index, the tree structure become lopsided (ex like a long line) and lookups start to take O(N) time again. This can be resolved by balancing your tree. The B-Tree is one such approach which also takes advantage of systems capable of doing large blocks of I/O, and so is most appropriate for databases.
You can achieve O(1)
if you pre-allocate N entries an array and hash the key to this N values.
But then if you have more than N entries stored there are collision. So for each key in the array you have a list of value. So it's not exactly O(1)
anymore. Scanning the list itself will be O(m)
where m is the average number of collision.
Example with hash = n mod 3
0 --> [0,a] [3,b] ...
1 --> [1,a] [4,b] [7,b] ...
2 --> [2,a] [5,b] ...
At a point in time, it becomes some bad that you spend more time traversing the list of value for a potential key than using another structure with O(log n)
lookup time, where n is the total number of entries.
You could of course pick N so big that the array/hash would be faster than the B-Tree. But the array has a fixed pre-allocated size. So if N=1000 and you store 3 values, you've wasted 997 slots in the array.
So it's essentially a performance-space trade-off. For small set of value, array
and hashing is excellent. For large set of value, B-Tree
are most efficient.
Hashes are the fastest look up data structures, but have some problems:
a) are not sorted b) no matter how good the hash is, will have collisions, that becomes problematic when lots of data c) to find a hash value in a hash indexed file takes a long time, so most of the time hashes make sense only for in memory (RAM) data - which makes them not suitable for databases- that most of the time cannot fit all data in RAM
Sorted trees address these issues, and b-trees operations in particular can be implemented efficiently using files. The only drawback is they have slower lookup times as a hash structure.
No data structure is perfect in all cases, depending on estimated size of data and how you use it, one is better.