tags:

views:

31

answers:

3

I have a table with hundreds of millions rows with schema like below.

tabe AA {
 id integer primay key,
 prop0 boolean not null,
 prop1 boolean not null,
 prop2 smallint not null,
 ...
}

The each "property" field (prop0, prop1, ...) has a small number of distinct values. And I usually query to find "id" from the given conditions of properties fields. I think Bitmap index is best for this query. But postgresql seems not support bitmap index.

I tried b-tree index on each field but these indexes are not used according to the query explain.

Is there a good alternative way to do this?

(i'm using postgresql 9)

A: 

Your real problem is a bad schema design, not the index. The properties should be placed in a different table and your current table should link to that table using a many to many relation.

The BIT datatype might also be of use, just check the manual.

Frank Heikens
All properties are orthogonal to each other. So if i normalize N properties, then there would be N tables. What you saying is i should group N properties into m groups, and make m tables which is permutation of group members, then link the AA table to this m tables to increase cardinality of each field?
tk
No, I'm saying you need a table "properties", "aa" and "properties_aa". The last table only holds the relations between properties and your aa table. This table will be huge, but can be indexed. Booleans are almost imposible to index, you only have 3 options: NULL, FALSE and TRUE. id's in the table properties_aa are much better candidates.
Frank Heikens
A: 

An index is only used if it actually speeds up the query which is not necessarily always the case. Especially with smallish tables (say thousands of rows) a full table scan ("seq scan" in the Postgres execution plan) might indeed be a lot faster.

How many rows did the table have when you tried the statement? How did the query look like? Maybe there are other conditions that prevent the index usage. Did you analyze the table to have up-to-date statistics?

a_horse_with_no_name
As i mentioned above, the number of rows is hundreds of millions. I usually do 'select id from AA where prop0 = true and prop1 = false and prop2 = 3 and ...'. I analyzed the table to update statistics.
tk
Can you post the execution plan? Ideally the output of EXPLAIN ANALYZE? How are the true/false values distributed? A seq scan is usually faster than an index lookup if more than ~20% of all rows are selected. So if prop0 = true yield have of all rows, there is no benefit in using the index. If you are always using all columns a composite index on the them will probably make more sense than one index for each property.
a_horse_with_no_name
A: 

Create a multicolumn index on properties which are always or almost always queried. Or several multicolumn indexes if needed.

The alternative, when you do not query the same properties almost always, is to make a tsvector column with words describing your data, maintained using trigger, for example

prop0=true
prop1=false
prop2=4

would be

'propzero nopropone proptwo4'::tsvector

index it using GIN and then use full text search for searching:

where tsv @@ 'popzero & nopropone & proptwo4'::tsquery
Tometzky