views:

244

answers:

2

A while ago I posted a message about optimizing a query in MySQL. I have since ported the data and query to PostgreSQL, but now PostgreSQL has the same problem. The solution in MySQL was to force the optimizer to not optimize using STRAIGHT_JOIN. PostgreSQL offers no such option.

Update Revised

I have isolated the part of the query that fixes the problem (d.month_ref_id = 1):

select
  d.*
from
  daily d
join month_ref m on m.id = d.month_ref_id 
join year_ref y on y.id = m.year_ref_id
where
  m.category_id = '001' and
  d.month_ref_id = 1 

However, I can't hard-code a month reference to 1. The query that produces a full table scan is:

select
  d.*
from
  daily d
join month_ref m on m.id = d.month_ref_id 
join year_ref y on y.id = m.year_ref_id
where
  m.category_id = '001'

The index on daily.month_ref_id is:

CREATE INDEX daily_month_ref_idx
  ON climate.daily
  USING btree
  (month_ref_id);

Why is the query performing a full table scan and what can be done to avoid it?

Thank you!

+1  A: 

I don't know what other variations of the query you've tried, but the JOIN on City seems a little strange - have you tried replacing that with a WHERE clause? Also, the relationships between the various tables are currently in the WHERE clause - these are probably best implemented as an INNER JOIN.

Disclaimer: I don't know PostreSQL specifically.

EDIT: Here's a link that describes changing WHERE clauses to JOINs to influence join order, and discusses the join_collapse_limit to force the optimizer to use your specified join order. http://www.postgresql.org/docs/8.2/static/explicit-joins.html

EDIT2: Another alternative is to nest SELECT statements, which may also force the optimizer to construct the query in the (reverse) nesting order you specify.

mdma
What about moving the WHERE clasues as joins? See my edit - using joins seems to be the way to go to provide a hint about join order.
mdma
+2  A: 
  1. Even though it might not make much a performance difference, I would use Join clauses for join the tables instead of cross joins and the Where clause.
  2. You are calling a function in the Where clause which will cause the system to do a table scan. It will not matter what database you use, this is going to be true.
  3. Why the Left Join on City? Do you know for a fact that the given Id will exist (in this case 10663? If so, you should use an inner join.
  4. You might be able to given the compiler hints about how to formulate the query using parenthesis (I'm not sure if Postgres will honor them).
Select  avg(d.amount) AS amount,  y.year
From (station s
        Left Join city c -- You want to cross join on city? Why not use an Inner join?
            On c.id = 10663
                And 6371.009 
                  * SQRT( 
                        POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) 
                        + (
                            COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) 
                            * POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2)
                            )
                        ) <= 50)
    Join station_district sd
        On sd.Id = s.station_district_id
    Join year_ref y
        On y.station_district_id = sd.id
    Join month_ref m
        On m.year_ref_id = y.id
    Join daily d
        On d.month_ref_id = m.id
Where s.elevation Between 0 And 2000 
    And y.year Between 1980 And 2000
    And m.month = 12
    And m.category_id = '001'
    And d.daily_flag_id <> 'M'
Group By y.year

Since you are not using the station, station_district nor city table in the results, you might be able to move those to an exists statement:

Select  avg(d.amount) AS amount,  y.year
From year_ref y
    Join month_ref m
        On m.year_ref_id = y.id
    Join daily d
        On d.month_ref_id = m.id
Where y.year Between 1980 And 2000
    And m.month = 12
    And m.category_id = '001'
    And d.daily_flag_id <> 'M'
    And Exist   (
                Select 1
                From station s1
                    Join city c1
                        On c1.id = 10663
                Where 6371.009 
                      * SQRT( 
                            POW(RADIANS(c1.latitude_decimal - s1.latitude_decimal), 2) 
                            + (
                                COS(RADIANS(c1.latitude_decimal + s1.latitude_decimal) / 2) 
                                * POW(RADIANS(c1.longitude_decimal - s1.longitude_decimal), 2)
                                )
                            ) <= 50
                    And S1.station_district_id = y.station_district_id
                )
Group By y.year
Thomas
@Thomas: I don't mind going a full table scan on the station or city tables, they only have a few thousand items. It's the `MONTH_REF` and `DAILY` tables that must avoid FTS.
Dave Jarvis
@Dave Jarvis - I had a typo in the query in that I had the station_district table in the inner query. I've corrected that. Btw, another solution would be a derived table (akin to a subquery in the From clause) that isolates the mathematical calculation and forces the compiler to execute separately.
Thomas
@Dave Jarvis - If the month_ref and daily tables are the largest and will benefit the most from filtering, then you might try putting those into a derived table with their criteria.
Thomas
@Thomas: The query is parameterized, and there will be more and more parameters added. So m.month = 12 is really `m.month = $P{SelectedMonth}`. I don't think I can filter?
Dave Jarvis
@Thomas: I have run the EXPLAIN against what you suggested. Still getting a full table scan on `DAILY`. I do appreciate your help. Gives me somewhere to look.
Dave Jarvis
@Thomas: I made the city an inner join. Helped a lot!
Dave Jarvis
@Dave Jarvis - If you comment out the Exists statement, do you still get a table scan? If so, then either there isn't an index on daily_flag_id or given the criteria, the compiler thinks that doing a table scan is more efficient for some reason. RE: Query parameter, I do not see why you couldn't use your parameter. The real question whether you can avoid a scan under optimal circumstances using static values.
Thomas
@Thomas: There isn't an index on the `daily_flag_id` -- most of the rows have an empty value. It is only once the data set (some 68000 rows) is returned that it should be scanning through to find the rows.
Dave Jarvis
@Thomas: Thanks for all the help; I'm going to rewrite the table structure to be more PostgreSQL-friendly.
Dave Jarvis