ansaurus

Question

Indexing a column used to ORDER BY with a constraint in PostgreSQL

Answer 1

A:

As you increase the number of rows, the index cardinality changes. I am not sure, but it could be that because it is using a greater number of rows from the table, it will need to read enough table blocks that those plus the index blocks are enough to make the index no longer make sense to use. This may be a miscalculation by the planner. Also your name (the field indexed) is not the field limiting the scope of the index which may be wreaking havoc with the planner math.

Things to try: Increase the percentage of the table considered when building your statistics, your data may be skewed in such a way that the statistics are not picking up a true representative sample.

Index all rows, not just the NULL rows, see which is better. you could even try indexing where NOT NULL.

Cluster based on an index on that field to reduce the data blocks required and turn it into a range scan.

Nulls and indexes do not always play nice. Try another way:

alter table crm_venue add column char delete_flag;
update crm_venue set delete flag='Y' where delete_date is not null;
update crm_venue set delete flag='N' where delete_date is null;
create index deleted_venue (delete_flag) where delete_flag = 'N';
SELECT * FROM crm_venue WHERE delete__flag='Y' ORDER BY name ASC LIMIT 20;

Grant Johnson 2010-09-24 17:37:44

Answer 2

A:

My guess would be that since, logically, the index is comprised of pointers to a set of rows on a set of data pages. if you fetch a page that is known to ONLY have "deleted" records on it, it doesn't have to recheck the page once it is fetched to only fetch the records that are deleted.

Therefore, it may be that when you do LIMIT 10 and order by the name, the first 10 that come back from the index are all on a data page (or pages) that are comprised only of deleted records. Since it knows that these pages are homogeneous, then it doesn't have to recheck them once it's fetched them from disk. Once you increase to LIMIT 20, at least one of the first 20 is on a mixed page with non-deleted records. This would then force the executor to recheck each record since it can't fetch data pages in less than 1 page increments from either the disk or the cache.

As an experiment, if you can create an index (delete_date, name) and issue the command CLUSTER crm_venue ON where the index is your new index. This should rebuild the table in the sort order of delete_date then name. Just to be super-sure, you should then issue a REINDEX TABLE crm_venue. Now try the query again. Since all the NOT NULLs will be clustered together on disk, it may work faster with the larger LIMIT values.

Of course, this is all just off-the-cuff theory, so YMMV...

Matthew Wood 2010-09-24 19:17:24

Thank you very much for that answer. Unfortunately using a boolean instead of a date is not a n option. Having it as a nullable date is A) so much more practical than a bool and B) it's too late now for the application to take on this change.

Peter Bengtsson 2010-09-26 17:34:59

My apologies, I did not mean to imply that you needed to add a boolean. I'm so used to people having a soft delete "flag", I wrote it without thinking. I'll update the entry to read "delete date" instead of "delete_flag" so it will be more clear.

Matthew Wood 2010-09-27 14:25:33

ansaurus

tags:

views:

answers:

Indexing a column used to ORDER BY with a constraint in PostgreSQL

related questions