tags:

views:

205

answers:

5

consider the following pgSQL statement:

SELECT DISTINCT some_field 
  FROM some_table 
  WHERE some_field LIKE 'text%' 
  LIMIT 10;

Consider also, that some_table consists of several million records, and that some_field has a b-tree index.

Why does the query take so long to execute (several minutes)? What I mean is, why doesnt it loop through creating the result set, and once it gets 10 of them, return the result? It looks like the execution time is the same, regardless of whether or not you include a 'LIMIT 10' or not.

Is this correct or am I missing something? Is there anything I can do to get it to return the first 10 results and 'screw' the rest?

UPDATE: If you drop the distinct, the results are returned virtually instantaneously. I do know however, that many of the some_table records are fairly unique already, and certianly when I run the query without he distinct declaration, the first 10 results are in fact unique. I also eliminated the where clause (eliminating it as a factor). So, my original question still remains, why isnt it terminating as soon as 10 matches are found?

A: 

I'm suspicious it's because you don't have an ORDER BY. Without ordering, you might have to cruise a whole lot of records to get 10 that meet your criterion.

Charlie Martin
I would think not having an ORDER BY would speed things up. If you have ORDER BY, the database needs to return the ten "lowest" rows, which involves sorting or all rows (or clever use of an index on the sort column). Now it just needs to return the first ten (distinct) rows it finds.
Thilo
This isn't necessarily true. I believe that this is a new feature in postgres 8.2 or 8.3 for example. Other dbms's will probably differ in support for this optimization.
Dana the Sane
I think the DISTINCT answer's right anyway. That *guarantees* you need to scan lots of rows, where using a random order only means there's a certain probability of needing to scan lots of rows.
Charlie Martin
+7  A: 

You have a DISTINCT. This means that to find 10 distinct rows, it's necessary to scan all rows that match the predicate until 10 different some_fields are found.

Depending on your indices, the query optimizer may decide that scanning all rows is the best way to do this.

10 distinct rows could represent 10, a million, an infinity of non-distinct rows.

tpdi
+1  A: 

Any time there's an operation involved that involves aggregation, and "DISTINCT" certainly qualifies, the optimizer is going to do the aggration before even thinking about what's next. And aggration means scanning the whole table (in your case involving a sort, unless there's an index).

But the most likely deal-breaker is that you are grouping on an operation on a column, rather than a plain column value. The optimizer generally disregards a number of possible operations once you are operating on a column transformation of some kind. It's quite possibly not smart enough to know that the ordering of "LIKE 'text%'" and "= 'text'" is the same for grouping purposes.

And remember, you're doing an aggregation on an operation on a column.

le dorfier
A: 

how big is the table? do you have any indexes on the table? check your query execution plan to determine if it's doing a table scan, an index scan, or an index seek. if it's doing a table scan then you most likely dont have any indexes.

try putting an index on the field your filtering by and/or the field your selecting.

DForck42
how do I check the query execution plan?
Ash
+3  A: 

Can you post the results of running EXPLAIN on the query? This will reveal what Postgres is doing to execute the query, and is generally the first step in diagnosing query performance problems.

It may be sorting or constructing a hash table of the entire rowset to eliminate the non-distinct records before returning the first row to the LIMIT operator. It makes sense that the engine should be able to read a fraction of the records, returning one new distinct at a time until the LIMIT clause has satisfied its 10 quota, but there may not be an operator implemented to make that work.

Is the some_field unique? If not, it would be useless in locating distinct records. If it is, then the DISTINCT clause would be unnecessary, since that index already guarantees that each row is unique on some_field.

Chris Smith