views:

339

answers:

2

I'm planning software that's an OLAP application at its heart (it helps analyse metering data) and is going to have some kind of star schema for its database, because the stored values will be looked at from different angles (time, source, type etc.) and the requests will be asking for aggregated data along these dimensions. The queries tend to deliver a lot of rows (up to some 100 000).

My research on this topic (see also my question here) seems to indicate that bitmap indices are a good way to search for data the way I'm planning to. However, I want to support multiple db engines, some of which do not offer bitmap indices on their tables (in particular, MySQL).

Now, I can certainly build and maintain my own bitmap index and use it to look for row ids pointing to the fact table. However, I suspect that this is going to defeat the whole purpose of the index, because the database is still going to search for row ids in a B-Tree. Could somebody with more profound theoretical background or more experience tell me if I still gain anything, like not having to do slow JOINs on the dimension tables?

I would also appreciate hints on what I have to evaluate if the answer is not straightforward.

+1  A: 

Some DB engines that do not directly support bitmap indexes still have star optimisations that can do this type of query without hitting the fact table. SQL Server, for instance has a feature called Index Intersection that does something similar by constructing bitmaps on the fly to do the resolution. Microsoft claims that the performance of this is comparable to bitmap indexes. See This posting for a bit of fan-out on this topic.

I'm not sure off the top of my head if MySQL does this, but Postgresql certainly does. IIRC some of the variants (Greenplum, I think) also directly support bitmap indexes and there was some talk of incorporating it in the main DB engine. I don't recall if this has been done yet.

I think you will find that most modern DBMS platforms offer star query optimisations of one sort or another, so you probably don't need to re-invent the wheel. You may find one or two that cannot do this, but you always have the option of just not supporting them.

ConcernedOfTunbridgeWells
+1  A: 

I've had good luck with bitmap indices when manipulating a lot of data in memory using custom data structures, but they are kind of awkward to implement over a third-party database that doesn't have a good (postgresql-like) API for extending their index structures.

In general since you will be searching through a B-Tree index anyways you won't gain anything if my experience is any guide.

So, no.

If your application is inherently OLAP in nature and you have a small number of dimensions which naturally group into ordered ranges, and you really need to change the asymptotics of your problem, you might consider building a 'sum table' like structure then you can query it for any hierarchical answer with 2^d operations, and you can amortize that if you are doing a number of related queries.

An example in 2d with coordinates x and y, where you are interested in the sum over a range from (x1,y1) to (x2,y2).

Stored separately you'd have to sum a number of entries proportional to the area.

Using a sumtable, for each position (x,y) don't store the value of that position, but instead store the sum the region from (0,0) to (x,y).

Then you can answer any range query by asking:

sum(x2,y2) - sum(x1,y2) - sum(x2,y1) + sum(x1,y1)

a constant amount of overhead (well, logarithmic in the dataset size, assuming you have an index on x and y and are storing it in SQL)

This of course breaks down if you have complicated attributes that don't break down into ranges, but can handle simple lexicographical indexes, dates, etc.

Edward Kmett