I spent over an hour today puzzling myself over a query plan that I couldn't understand. The query was an UDPATE
and it just wouldn't run AT ALL. Totally deadlocked: pg_locks
showed it wasn't waiting for anything either. Now, I don't consider myself the best or worst query plan reader guy, but I find this one exceptionally difficult. I'm wondering how does one read these? Is there a methodology that the Pg aces follow in order to pinpoint the error?
I plan on asking another question as to how to work around this issue, but right now I'm speaking specifically on how to read these types of plans. Please do not point to any generic tutorial unless it specifically address this problem, highlighted below the query plan.
QUERY PLAN
--------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=47680.88..169413.12 rows=1 width=77)
Join Filter: ((co.fkey_style = v.chrome_styleid) AND (co.name = o.name))
-> Nested Loop (cost=5301.58..31738.10 rows=1 width=81)
-> Hash Join (cost=5301.58..29722.32 rows=229 width=40)
Hash Cond: ((io.lot_id = iv.lot_id) AND ((io.vin)::text = (iv.vin)::text))
-> Seq Scan on options io (cost=0.00..20223.32 rows=23004 width=36)
Filter: (name IS NULL)
-> Hash (cost=4547.33..4547.33 rows=36150 width=24)
-> Seq Scan on vehicles iv (cost=0.00..4547.33 rows=36150 width=24)
Filter: (date_sold IS NULL)
-> Index Scan using options_pkey on options co (cost=0.00..8.79 rows=1 width=49)
Index Cond: ((co.fkey_style = iv.chrome_styleid) AND (co.code = io.code))
-> Hash Join (cost=42379.30..137424.09 rows=16729 width=26)
Hash Cond: ((v.lot_id = o.lot_id) AND ((v.vin)::text = (o.vin)::text))
-> Seq Scan on vehicles v (cost=0.00..4547.33 rows=65233 width=24)
-> Hash (cost=20223.32..20223.32 rows=931332 width=44)
-> Seq Scan on options o (cost=0.00..20223.32 rows=931332 width=44)
(17 rows)
The issue with this query plan - I believe I understand - is probably best said by RhodiumToad
(he is definitely better at this, so I'll bet on his explanation being better) of irc://irc.freenode.net/#postgresql
:
oh, that plan is potentially disasterous the problem with that plan is that it's running a hugely expensive hashjoin for each row the problem is the rows=1 estimate from the other join and the planner thinks it's ok to put a hugely expensive query in the inner path of a nestloop where the outer path is estimated to return only one row. since, obviously, by the planner's estimate the expensive part will only be run once but this has an obvious tendency to really mess up in practice the problem is that the planner believes its own estimates ideally, the planner needs to know the difference between "estimated to return 1 row" and "not possible to return more than 1 row" but it's not at all clear how to incorporate that into the existing code
He goes on to say:
it can affect any join, but usually joins against subqueries are the most likely
Now when I read this plan the first thing I noticed was the Nested Loop Anti Join
, this had a cost of 169,413
(I'll stick to upper bounds). This Anti-Join breaks down to the result of a Nested Loop
at cost of 31,738
, and the result of a Hash Join
at a cost of 137,424
. Now, the 137,424
, is much greater than 31,738
so I knew the problem was the Hash Join. Then I proceed to EXPLAIN ANALYZE
the Hash Join segment outside of the query. It executed in 7 secs. I made sure there was indexes on (lot_id, vin), and (co.code, and v.code) -- there was. I disabled seq_scan
and hashjoin
individually and notice a speed increase of less than 2 seconds. Not near enough to account for why it wasn't progressing after an hour.
But, after all this I'm totally wrong! Yes, it was the slower part of the query, but because the rows="1"
bit (I presume it was on the Nested Loop Anti Join
). Is there a tutorial that will help me identify these types of issues. Here it is a bug (lack of ability) in the planner mis-estimating the amount of rows? How am I supposed to read into this to come to the same conclusion RhodiumToad
did?
Is it simply rows="1"
that is supposed to trigger me figuring this out?
I did run VACUUM FULL ANALYZE
on all of the tables involved, and this is Postgresql 8.4.