views:

84

answers:

3

High-level: Can I do this order by, group by based on sum any faster? (PG 8.4, fwiw., on a non-tiny table .... think O(millions of rows) )

Suppose I had a table like this:

                                 Table "public.summary"
   Column    |       Type        |                      Modifiers
-------------+-------------------+------------------------------------------------------
 ts          | integer           | not null default nextval('summary_ts_seq'::regclass)
 field1      | character varying | not null
 otherfield  | character varying | not null
 country     | character varying | not null
 lookups     | integer           | not null


Indexes:
    "summary_pk" PRIMARY KEY, btree (ts, field1, otherfield, country)
    "ix_summary_country" btree (country)
    "ix_summary_field1" btree (field1)
    "ix_summary_otherfield" btree (otherfield)
    "ix_summary_ts" btree (ts)

And the query I want is:

select summary.field1,
    summary.country,
    summary.ts,
    sum(summary.lookups) as lookups,
from summary
where summary.country = 'za' and
    summary.ts = 1275177600
group by summary.field1, summary.country, summary.ts
order by summary.ts, lookups desc, summary.field1
limit 100;

(English: top 100 field1's at a particular (ts,country) where 'topness' is the sum of lookups for any matching row, regardless of value of otherfield)

Is there anything I can really do to speed this up? Algorithmically this seems to be a full table scan kind of thing, but I might be missing something.

+1  A: 

In order to be able to suggest anything, you should post the execution plan of the query.

And "OMG Ponies" is right: limit 100 will limit the overall result to 100 rows, it will not work on individual groups!

There is a nice article in the Postgres Wiki that explains how to post a question related to a slow query:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

a_horse_with_no_name
I have corrected the question to reflect OMG Ponies point. They are correct, but this is the right query, and what I want. I have updated the text to match.
Gregg Lind
+2  A: 

Any query plan for this query will have to scan every row that matches the WHERE conditions, rolling them up by the grouping conditions - that is, the amount of work is proportional to the number of input rows to the group by, not the number of result rows.

The most efficient query plan possible for a query like this is a single index scan. This ought to be possible if you build an index on (country, ts) in that order; with that index, every possible query of this form resolves to a contiguous range over the index. This will still require an in-memory sort, though - it may be possible to avoid this with a different index.

As others have said, though, posting an execution plan is your best option.

Nick Johnson
+1  A: 

Index on (country, ts) is a best bet (like Nick Johnson suggests), and additionally you may want to raise work_mem if its not set very high. You can SET this at runtime if needed (and if making it very high, then recommended). It will help keep your sorts in memory and not spill to disk (if thats happening).

For real help, we'll need to see an EXPLAIN ANALYZE, posting it on explain.depesz.com can make it very readable.

rfusca