ansaurus

Question

IN statement performance in PostgreSQL (and in general)

Answer 1

+6 A:

When the number of records returned by the subquery is small, and the resulting number of rows returned by the main query is also small, you'll just get fast index lookups on each. As the percentage of the data returned increases, eventually each of the two will switch to using a sequential scan instead of an indexed one, to grab the whole table in one gulp rather than piece it together. It isn't a simple fall-off in performance that's either linear or exponential; there are major discontinuities as the type of plan changes. And the number of rows at which those happen depends on the size of the tables, so no useful rules of thumb for you there either. You should build a simulation like I'm doing below and see what happens on your own data set to get an idea what the curve looks like.

Here's an example of how that works using a PostgreSQL 9.0 database loaded with the Dell Store 2 database. Once there's 1000 rows being returned by the subquery, it's doing a full table scan of the main table. And once the subquery is considering 10,000 records, that turns into a full table scan too. These were each run twice, so you're seeing the cached performance. How performance changes based on cached vs. uncached status is a whole 'nother topic altogether:

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<2);
Nested Loop  (cost=8.27..16.56 rows=1 width=268) (actual time=0.051..0.060 rows=1 loops=1)
  ->  HashAggregate  (cost=8.27..8.28 rows=1 width=4) (actual time=0.028..0.030 rows=1 loops=1)
        ->  Index Scan using orders_pkey on orders  (cost=0.00..8.27 rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1)
              Index Cond: (orderid < 2)
  ->  Index Scan using customers_pkey on customers  (cost=0.00..8.27 rows=1 width=268) (actual time=0.013..0.016 rows=1 loops=1)
        Index Cond: (customers.customerid = orders.customerid)
Total runtime: 0.191 ms

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<100);
Nested Loop  (cost=10.25..443.14 rows=100 width=268) (actual time=0.488..2.591 rows=98 loops=1)
  ->  HashAggregate  (cost=10.25..11.00 rows=75 width=4) (actual time=0.464..0.661 rows=98 loops=1)
        ->  Index Scan using orders_pkey on orders  (cost=0.00..10.00 rows=100 width=4) (actual time=0.019..0.218 rows=99 loops=1)
              Index Cond: (orderid < 100)
  ->  Index Scan using customers_pkey on customers  (cost=0.00..5.75 rows=1 width=268) (actual time=0.009..0.011 rows=1 loops=98)
        Index Cond: (customers.customerid = orders.customerid)
Total runtime: 2.868 ms

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<1000);
Hash Semi Join  (cost=54.25..800.13 rows=1000 width=268) (actual time=4.574..80.319 rows=978 loops=1)
  Hash Cond: (customers.customerid = orders.customerid)
  ->  Seq Scan on customers  (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..33.665 rows=20000 loops=1)
  ->  Hash  (cost=41.75..41.75 rows=1000 width=4) (actual time=4.502..4.502 rows=999 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 24kB
        ->  Index Scan using orders_pkey on orders  (cost=0.00..41.75 rows=1000 width=4) (actual time=0.056..2.487 rows=999 loops=1)
              Index Cond: (orderid < 1000)
Total runtime: 82.024 ms

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<10000);
Hash Join  (cost=443.68..1444.68 rows=8996 width=268) (actual time=79.576..157.159 rows=7895 loops=1)
  Hash Cond: (customers.customerid = orders.customerid)
  ->  Seq Scan on customers  (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..27.085 rows=20000 loops=1)
  ->  Hash  (cost=349.97..349.97 rows=7497 width=4) (actual time=79.532..79.532 rows=7895 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 186kB
        ->  HashAggregate  (cost=275.00..349.97 rows=7497 width=4) (actual time=45.130..62.227 rows=7895 loops=1)
              ->  Seq Scan on orders  (cost=0.00..250.00 rows=10000 width=4) (actual time=0.008..20.979 rows=9999 loops=1)
                    Filter: (orderid < 10000)
Total runtime: 167.882 ms

Greg Smith 2010-06-09 02:30:26

ansaurus

tags:

views:

answers:

IN statement performance in PostgreSQL (and in general)

related questions