views:

80

answers:

4

I am developing a little data warehouse system with a web interface where people can do filtered searches. There are current about 50 columns that people may wish to filter on, and about 2.5 million rows. A table scan is painfully slow. The trouble is that the range of queries I'm getting have no common prefixes.

Right now I'm using sqlite3, which will only use an index if there the columns required are the leftmost columns in that index. This seems to mean I'd need a lot of indexes. A quick glance at MySQL suggests it would also require many indexes for this kind of query.

My question is what indexing implementations are available for different database systems which can handle this kind of query on arbitrary combinations of columns?

I've prototyped my own indexing scheme; I store extra tables which list integer primary keys in my big table where each value for each column occur, and I keep enough statistics to be able to first examine the values with the smallest number of matches. It works okay; much better than a table scan but still a bit on the slow side, which is unsurprising for a first version in Python doing many SQL queries.

A: 

One should only consider introducing "home grown" index structures, based on SQL tables, as a last resort, i.e. if there still exists [business-wise plausible] query cases not properly handled with an traditional index setting. For example if the list of such indexes were to become to big etc.

A few observations
You do not necessarily need indexes that include all of the columns that may be involved in one particular query; only the [collectively] selective ones may be required.

In other words if the query uses, for example, columns a, b, c and d, but if an index with a and b exists and if that produces, statistically only a few thousand rows, it may be acceptable to not introduce indexes with a, b and c (or and d or both), if c or d are not very plausible search criteria (used infrequently), and if their width is such that is would unduly burden the a+b index (or if there were other columns with a better fit for being "tacked-on" to the a+b index).

Aside from the obvious additional demand they put on disk storage, additional indexes, while possibly helping with SELECT (read) operations may also become an impediment with CUD (Create/Update/Delete) operations. It appears the context here is akin to a datawarehouse, where few [unscheduled] CUD operations take place, but it is good to keep this in mind.

See SQLite Optimizer for valuable insight into the way SQLite determines the way a particular query is executed.

Making a list of indexes
A tentative basis for the index scheme for this application may look like this:

  • [A] A single column index for every column in the table (save maybe the ones which are ridiculously unselective, say a "Married" column w/ "Y/N" values in it....)
  • [B] A two (or three) columns index for each the likely/common use case queries
  • [C] Additional two/three column indexes for the cases where some non-common query case involves a set of columns none of which is individually selective.

From this basis we then can define the actual list of indexes needed by:

  • Adding one (or a few) extra columns at the end of (and in a well thought out order...) to the [B] indexes above. Typically such columns are choosed because of their relative small width (they do grow the index unduly) and because they have a relative chance of being used in combination with the columns cited before them in the index.
  • Removing the [A] indexes which are generally equivalent to one or several [B] indexes. That is: columns which start with the same column, and for which the extra columns do no burden much the index.
  • reviewing the TREE of all possible (or all acceptable) cases, and marking off the branches adequately served with the indexes above. Then adding yet more indexes for the odd use cases not readily covered (if only with partial index scan + main table lookup for an acceptable number of rows).

In this situation, I find a hand-written tree structure a useful tool to help manage the otherwise unmanageable lists of possible combinations. Assuming a maximum of 4 search criteria selected from the 50 columns indicated in the question, we have in excess of 230,000 combinations to consider... The tree helps prune this rather quickly.

mjv
This is a reasonable approach, once I have a good idea of what query types people are going to try. However, with a significant (and growing) number of columns, and with the kinds of queries that people do changing over time, I think I'd spend a lot of time updating the indexing scheme, and building indexes. A table scan currently takes over a day, so I can't react quickly to someone grumbling about slow queries.
Dickon Reed
Wow... "A table scan takes a day" Even accounting for a bit of exaggeration, this is way, way too slow. Maybe time to consider another DBMS altogether?
mjv
To be more precise rebuilding the whole warehouse cached table from the logs takes over a day; a table scan matching on one column takes about 100 seconds. It works out at less than 50ms a row. Of course given the main table I should be able to build indices much more quickly than that.
Dickon Reed
A: 

SInce data warehouses are typically optimized for reading data not writing it, I would consider simply indexing all the columns. Yes this will slow down putting data into the warehouse, but typically that happens during non-peak hours and only once a day or less often.

HLGEM
That'd be fine if the database system actually used more than one index for a single table select, which I've not yet persuaded sqlite to try. Otherwise, using a single column index might get me down a few hundred thousand rows, and scanning each possible row is still slow.
Dickon Reed
Tried that, but the leftmost prefix rule means a single index with every column won't work most of the time, and one index per column will still mean examining a lot of rows.
Dickon Reed
+1  A: 

When there's too many indexes to create for a table I usually fall back on Full Text Search. Can't say if it will fit your scenario though.

Nicolas Buduroi
INteresting idea, but I _think_ the best this is going to reduces to equivalent to a set of single column indexes, which may still match a huge number of rows.
Dickon Reed
Hmm, I can see a really good indexed text search may end up being good enough. Next I need to survey the text search implementations :)
Dickon Reed
+2  A: 

There are column-oriented databases that store data on a per-column base, where every column is its own index. They are a very good fit for Data Warehouse as they are extremly fast to read, but fairly slow to update.

Kickfire is such an example, which is a customized MySQL engine and has held the TPC-H benchmark top crown for a number of weeks, at an impressive system cost. Note that Kickfire is an appliance, sold as a hardware box.

Infobright would be another similar example, and has a free community edition that runs on Windows and Linux.

Remus Rusanu
It's worth looking into this, to see if it addresses your problem. 50 potential indexes is just Big.
Philip Kelley
Yes, 50 indexes is big, but we are looking at more like 50! indexes which is ridiculous :)
Dickon Reed