views:

54

answers:

1

My question is that How database stores data and how it performs query internally.

Suppose we have following fields in our table:

  1. ID
  2. Name
  3. Age
  4. Weight
  5. Manager

and we query select * from Table1 where age>50 and weight<100

I am just curious that how it perform query internally.

What will the Node of B-Tre/B+Tree contains in this example?

+2  A: 

Hi!

The example you have chosen is one of the few cases where a single Tree can't do the job (two independent ranges).

However, the first chapter of my work-in-progress e-Book explains the inner workings of B-Tree indexes: http://use-the-index-luke.com/anatomy/

EDIT for more details why two indexes might be useful for the above example.

The above query can be supported by three possible index configurations:

  1. concatenated index on AGE and then WEIGHT (in this order).
    In case, the query would read all records WHERE AGE > 50 and then filter by WEIGHT.

  2. concatenated index on WEIGHT and then AGE (the other order).
    That goes the different way: read all records WHERE WEIGHT < 100 and then filter by AGE.

Whatever is more efficient depends on the data you have. If there are less records AGE > 50 than WEIGHT < 100 the first will be more efficient, otherwise the second. However, if you query with different values, the picture might change.

The reason that a concatenated index can't support the query well is that each index order is on one axis only. each index entry is before or after another one, but never next to it. All index entries build one chain.

A query that has two independent range queries would require two axes, not like a chain, but more like a chess board. one axis for AGE the other for WEIGHT. If that would be possible, your query would need to scan only one corner of the chess board.

However, a b-tree has only one axis, hence you must chose which criteria to use first. If you chose AGE it means that starting with AGE 50, the entire chain will be scanned until the end. Only some of the records stored at the side of the chain will also qualify for WEIGHT < 100, the other records must be read but will be discarded.

So, a long story to explain why one tree can not support a query with two independent range clauses. On the other hand, one concatenated index can do the following quite well:

WHERE age = 50 AND weight < 100
WHERE weight = 100 AND age > 50
WHERE age > 50 AND age < 70;

However, the problem arises if there are two inequality operators are used on two different columns.

So, what to do?

The third possible approach is to have two independent indexes on the two columns. That allows to have as many axes as you like (just create more indexes). However, there are a few huge problems with that. First of all, not all DB products support that. Whenever it is supported, it is a rather expansive operation. It works typically that way that each index is scanned, a bitmap index is built for each result. Those bitmap indexes are then joined to apply the AND operator. That's a lot of data munging--it is only worth the effort if each condition is not very selective for it's own, but both together are very selective.

Wan't my advice?

If your query runs in an OLTP environment: use one concatenated index. Two independent indexes are an option of last resort only. However, if you are working in an OLAP environment, you might anyways need bitmap indexes.

ps.: Indexing AGE was an exercise in my book (with solution)--especially because storing AGE is a bad practice, you should store the date of birth instead.

Markus Winand
Hi Markus,I have gone through the link of your work-in-progress e-book. You have explained the things very clearly, great job. I got idea of how indexes are stored internally but as par my original question, do we need two indexes for such kind of range queries??
ZeNo
Hi, i guess this box would be too little for my answer, so I have edited the above a 'little bit'.
Markus Winand
Thanks Markus for such neat explanation. I feel enlightened now :)
ZeNo