views:

82

answers:

2

I want to understand better index organization. Imagine we have a table with 2 columns:

CREATE TABLE user( 
  name varchar(100)
 ,age int)

We would like to create an index:

CREATE INDEX IDX_MultiColIdx on user(name,age)

How would B-Tree index organization look like?

In case of one column, say, age, the organization is clear: every non-leaf node would contain a set of integer keys which would be used for a search. Which values contain nodes of our *IDX_MultiColIdx* B-Tree index?

+2  A: 

Which values contains nodes of our *IDX_MultiColIdx* B-Tree index?

Values of name, age and the row pointer (RID/ROWID or clustered key, depending on the table organization), sorted lexicographically.

How exactly they will be stored, depends on the datatype and database system.

Usually, CHAR is stored right-padded with spaces up to its size, while VARCHAR is prepended with its length.

MyISAM and some other engines can use key compression: the matching parts of a set of keys are only stored once, and the other keys only store the differing parts, like this:

Hamblin
Hamblin, California
Hamblin (surname)
Hambling Baronets
Hambly
Hambly Arena    
Hambly Arena Fire
Hambo
Hambo Lama Itigelov
Hambok
Hambone

will be stored as:

Hamblin
[7], California
[7] (surname)
[7]g Baronets
Hambly
[6] Arena   
[6] Arena Fire
Hambo
[5] Lama Itigelov
[5]k
[5]ne

, where [x] means "take leading x characters from the previous key"

Quassnoi
You would have to ignore the length when inserting or searching for keys since inequalities like `... where colm > 'xyz'` would become very inefficient.
paxdiablo
@paxdiablo: what do you mean by "ignore the length"?
Quassnoi
It was to do with your "while VARCHAR is prepended with its length" comment. I mean, if you're using "<len><name><age>" as the comparison key, it will be primarily in "length of name" order not "name" order. In other words, it will sort `a,b,c,aa,ff,bbbbb` instead of `a,aa,b,bbbbb,c,ff`. It doesn't really matter where you store the length (as long as you can find it within the key), just don't use the length to sort/compare.
paxdiablo
@paxdiablo: actually, comparison functions can be a little bit more complex than `memcmp`. And how do you know where the column ends if you don't store its length?
Quassnoi
You can't know where the column ends without a length up front unless you use fixed length fields (i.e., pad out the name to its max length). You can still _use_ variable length keys, you just have to make sure the length is not factored into the index ordering (lest inequalities become a performance killer). The trade-off is between speed and space (as are most trade-offs in IT). Variable length keys complicate the search process within a node. You either have to search sequentially or maintain two parts, a set of fixed length offsets to the variable parts and the variable parts themselves.
paxdiablo
Don't misunderstand. There's nothing wrong with your answer, it's just you have to ensure queries are still efficient for inequalities as well. That can be done within the context of your answer, I was just pointing out that it should be.
paxdiablo
@paxdiablo: I still don't get your point. As soon as the index is built, order is defined by the node's position in the index. Traversing can be done without even invoking the comparison function: if the node comes later, it is greater. The comparison function shall be only invoked `O(log(n))` times, to determine the start and the end node of the range. How do you think computed indexes work? They may involve much more complex functions than comparing two length-prefixed strings, still the queries with them show almost the same performance.
Quassnoi
Traversing a tree can be done without a comparison function if you're talking about going from start to finish. I'm talking about finding a specific key which requires the comparison function to decide whether to move left or right. A tree can only have one order which should be the name/age pair in this case. What I was stating was that, if your comparison function uses name-length/name/age, the sort order will be _wrong_. You will still be able to find a specific key using equality (since you know both the name-length and name) but inequality will not be easy, since the order will be wrong.
paxdiablo
In other words, you will be able to find the start key but traversal from there will involve going all the way to the end, since it will be a mix of matching and non-matching keys in a name-length/name/age order (see my second comment). If you're looking for all keys starting with `b`, you start at `b` but you can't stop at `c` since there's a `bbbbb` later on (as per the first sorted list in that comment).
paxdiablo
@paxdiablo: if I am looking for all keys starting with `b`, the engine locates the first item starting with `b`, the first item starting with `c` and traverses all keys from the former to the latter (not returning the latter, of course). The comparison function is not ever called in between. In your example, why do you think the comparison function would sort `bbbbb` after `ff`? The comparison function is not a dumb `memcmp`. Believe it or not, the developers of the major `RDBMS` are smart enough to develop a comparison function that would sort `ff` after `bbbbb`.
Quassnoi
I see now that we were saying the same thing in different ways. I was stating that the comparison function had to take into account that it needed to sort based on the _name_, not the "block of memory length/name/age". You were saying that it wasn't a simple memcmp (same thing). Variable length keys are _still_ less efficient (timewise, not spacewise) because you cannot use a simple binary search on the node itself. You have to sequentially search within the node taking into account the length, or have a set of fixed length offsets at the start to indirectly binary-search.
paxdiablo
You seem to think it's better to find `b` and `c` independently and just do an in-order traversal without checking the comparisons. That's an interesting approach and may actually be faster than simply finding `b` and completing in-order traversal using the comparison function until you find a key greater. I'm not sure about that - I'd have to think about whether two O(log n) searches would be better than one where you're already traversing especially since the former still requires some sort of comparison, albeit possibly a cheaper one (e.g., record number) rather than a key comparison.
paxdiablo
I have to stand by my contention that I'd rather be time-efficient than space-efficient but everything else I think we agree on and I'm not _entirely_ certain that your key-difference method _would_ be slower, especially if it's limited to within the node. That would certainly save some space and, provided a single node didn't have a massive number of keys, the speed differences may well be negligible. So +1 for that.
paxdiablo
@paxdiablo: in most engines, fixed length offsets are stored in the header of the data block anyway. Sequential search is only unavoidable if the key compression is used (i. e. you cannot tell the key value only from the record itself). You may find this article interesting: http://explainextended.com/2010/02/04/index-search-time-depends-on-the-value-being-searched/. Note, however, that uncompressed keys imply having a larger table with more data blocks. Additional disk `I/O` can eliminate all benefit from the binary search. That's why key compression was implemented in the first place.
Quassnoi
@paxdiablo: Actually, finding two independent nodes is exactly how it works in most engines. And in my answer, I didn't make up any approaches: I just described how it was actiually done by other people (since I believe that's what was asked).
Quassnoi
+1  A: 

I assume you're asking about the internal database implementation because you mention 'non-leaf nodes'.

The interior nodes in a b-tree do not need to store the full key; they only need to store separator keys. Prefix and suffix compression mean that interior nodes can be very dense, and therefore reduce the height of the b-tree and therefore improve overall performance.

For example, given an index with the sequential keys <'A very long string', 314159> and <'Not the same string', 9348>, all the interior node needs to represent is the separation between those those keys, which can be represented in a single character. In a similar way, when the keys to be separated in the interior node have a common prefix, that prefix need only be stored once and the point where they diverge represented.

The leaf nodes need to store the full key values, and can be stored in a linked list for key order traversal. Leaf node pages can be compressed by using prefix compression or other techniques to further reduce the tree height.

For a good reference on this, see "Transaction Processing: Concepts and Techniques" by Gray & Reuter, and follow the references if you want more detail.

janm