ansaurus

Question

Why the most natural query(i.e. using INNER JOIN (instead of LEFT JOIN)) is very slow.

Answer 1

+3 A:

Is your query essentially the following (this is hard to ask as a comment):

select 

c.company_rec_id, 
c.the_company_code,
c.company

from 

company c 

where

exists (
    select *
    from parameter p
    join mlist_detail_parameter mdp on mdp.parameter_rec_id = p.parameter_rec_id
    join mlist_detail md            on md.mlist_detail_rec_id = mdp.mlist_detail_rec_id
    join mlist m                    on m.mlist_rec_id = md.mlist_rec_id

    join parcel_application ord_app on ord_app.parcel_application_rec_id = m.parcel_application_rec_id
    join parcel ord                 on ord.parcel_rec_id = ord_app.parcel_rec_id

    join tlist t                    on t.mlist_rec_id = m.mlist_rec_id

    where
        ord.client_rec_id = c.company_rec_id
    and to_tsvector(extract_words(p.parameter)) @@ plainto_tsquery(extract_words('cadmium'))
)

[EDIT: 2010-07-06, added by Michael Buen]

"Hash Join  (cost=2152.94..2172.52 rows=232 width=71) (actual time=71.106..71.207 rows=84 loops=1)"
"  Hash Cond: ((c.company_rec_id)::text = (ord.client_rec_id)::text)"
"  ->  Seq Scan on company c  (cost=0.00..11.95 rows=295 width=71) (actual time=0.004..0.030 rows=295 loops=1)"
"  ->  Hash  (cost=2150.04..2150.04 rows=232 width=37) (actual time=71.077..71.077 rows=84 loops=1)"
"        ->  HashAggregate  (cost=2147.72..2150.04 rows=232 width=37) (actual time=71.033..71.040 rows=84 loops=1)"
"              ->  Nested Loop  (cost=1783.22..2146.09 rows=652 width=37) (actual time=51.029..70.187 rows=1918 loops=1)"
"                    ->  Hash Join  (cost=1783.22..1938.61 rows=652 width=111) (actual time=51.014..55.913 rows=1918 loops=1)"
"                          Hash Cond: ((ord_app.parcel_rec_id)::text = (ord.parcel_rec_id)::text)"
"                          ->  Hash Join  (cost=1665.76..1810.55 rows=652 width=111) (actual time=48.360..52.004 rows=1918 loops=1)"
"                                Hash Cond: ((ord_app.parcel_application_rec_id)::text = (m.parcel_application_rec_id)::text)"
"                                ->  Seq Scan on parcel_application ord_app  (cost=0.00..122.18 rows=3218 width=74) (actual time=0.003..1.485 rows=3218 loops=1)"
"                                ->  Hash  (cost=1657.61..1657.61 rows=652 width=111) (actual time=48.331..48.331 rows=1918 loops=1)"
"                                      ->  Hash Join  (cost=164.19..1657.61 rows=652 width=111) (actual time=4.755..46.122 rows=1918 loops=1)"
"                                            Hash Cond: ((md.mlist_rec_id)::text = (m.mlist_rec_id)::text)"
"                                            ->  Nested Loop  (cost=3.49..1485.51 rows=652 width=37) (actual time=1.638..40.974 rows=1918 loops=1)"
"                                                  ->  Hash Join  (cost=3.49..1163.33 rows=652 width=37) (actual time=1.590..18.090 rows=1918 loops=1)"
"                                                        Hash Cond: ((mdp.parameter_rec_id)::text = (p.parameter_rec_id)::text)"
"                                                        ->  Seq Scan on mlist_detail_parameter mdp  (cost=0.00..1013.87 rows=37187 width=74) (actual time=0.003..5.499 rows=37187 loops=1)"
"                                                        ->  Hash  (cost=3.48..3.48 rows=1 width=37) (actual time=1.568..1.568 rows=1 loops=1)"
"                                                              ->  Seq Scan on parameter p  (cost=0.00..3.48 rows=1 width=37) (actual time=1.324..1.564 rows=1 loops=1)"
"                                                                    Filter: (to_tsvector(regexp_replace((parameter)::text, '[\\(\\)\\!\\.\\/,\\-\\?]+'::text, ' '::text, 'g'::text)) @@ plainto_tsquery('cadmium'::text))"
"                                                  ->  Index Scan using pk_mlist_detail on mlist_detail md  (cost=0.00..0.48 rows=1 width=74) (actual time=0.011..0.011 rows=1 loops=1918)"
"                                                        Index Cond: ((md.mlist_detail_rec_id)::text = (mdp.mlist_detail_rec_id)::text)"
"                                            ->  Hash  (cost=115.31..115.31 rows=3631 width=74) (actual time=3.096..3.096 rows=3631 loops=1)"
"                                                  ->  Seq Scan on mlist m  (cost=0.00..115.31 rows=3631 width=74) (actual time=0.003..0.994 rows=3631 loops=1)"
"                          ->  Hash  (cost=78.87..78.87 rows=3087 width=74) (actual time=2.640..2.640 rows=3087 loops=1)"
"                                ->  Seq Scan on parcel ord  (cost=0.00..78.87 rows=3087 width=74) (actual time=0.004..0.876 rows=3087 loops=1)"
"                    ->  Index Scan using fki_tlist__mlist on tlist t  (cost=0.00..0.31 rows=1 width=37) (actual time=0.006..0.006 rows=1 loops=1918)"
"                          Index Cond: ((t.mlist_rec_id)::text = (m.mlist_rec_id)::text)"
"Total runtime: 71.373 ms"

Stephen Denne 2010-07-06 01:31:06

hmm.. your answer is also fast (between 0.06 to 0.08 second), but not as fast as IN version (0.03 to 0.05 second). +1 nonetheless, your answer works. Your answer is semantically the same with my query here. Eventhough, I cannot use your query on my actual code, I refactor my original query so it can be used on two(or more) modules, I need the ord_app on outside of subquery, I have some modules that do some sort of GROUP_CONCAT on ord_app. I wish I could +2 you for taking time to deduce the intent of my code, and writing an actual query for that :-)

Michael Buen 2010-07-06 01:53:28

@Michael, I'd be interested in the explain analyze of this query - can you edit this answer to include it?

Stephen Denne 2010-07-06 01:56:33

Answer 2

+3 A:

(as directed, I'm putting part of my comment in an answer as it solved the problem)

Convert the EXISTS expressions into IN expressions.

This works better in this instance because the query will now be effectively evaluated from the "inside out" starting with the query that contains your most limiting factor: the full text search lookup. That query is going to return a small set of rows that can be looked up directly against the primary key of the outer query (WHERE x in (SELECT X...)) as opposed to calling the "inner" query once per value of the outer query (or for all values in your original case, if I am reading it correctly). The EXISTS method here results in Nested Loops (one evaluation of one query for each value in another) vs the IN method using Hash Joins (a much more efficient execution method in many, if not most, cases.

Notice that with the EXISTS method, there are four Nested Loops that execute with each running at least 3,000 times. That cost adds up. While it's not a direct comparison, you can treat Nested Loops like you would FOR loops in application code: each time you invoke an inner loop, your big-O estimate goes up an order of magnitude: O(n) to O(n^2) to O(n^3), etc.

Hash Join is more like a map, where two arrays are stepped through at the same time and an operation is performed on both. This is roughly linear (O(n)). Think of these being nested as additive so it would go O(n) to O(2n) to O(3n), etc.

Yeah, yeah, I know it's not quite the same thing, but the point is that having multiple nested loops usually indicates a slow query plan and comparing the two big-O style makes it easier to recognize, I believe.

Nested Loops and EXISTS are not evil, per se, but for most cases where there is a base filter condition that ultimately effects everything (for example, the full text search in the question), an IN expression (or, in some cases, a proper JOIN) yields a much more efficient plan.

Matthew Wood 2010-07-06 14:14:36

ansaurus

tags:

views:

answers:

Why the most natural query(i.e. using INNER JOIN (instead of LEFT JOIN)) is very slow.

related questions