tags:

views:

218

answers:

3

I have a table containing the runtimes for generators on different sites, and I want to select the most recent entry for each site. Each generator is run once or twice a week.

I have a query that will do this, but I wonder if it's the best option. I can't help thinking that using WHERE x IN (SELECT ...) is lazy and not the best way to formulate the query - any query.

The table is as follows:

CREATE TABLE generator_logs (
    id integer NOT NULL,
    site_id character varying(4) NOT NULL,
    start timestamp without time zone NOT NULL,
    "end" timestamp without time zone NOT NULL,
    duration integer NOT NULL
);

And the query:

SELECT id, site_id, start, "end", duration 
FROM generator_logs
WHERE start IN (SELECT MAX(start) AS start 
                FROM generator_logs 
                GROUP BY site_id) 
ORDER BY start DESC

There isn't a huge amount of data, so I'm not worried about optimizing the query. However, I do have to do similar things on tables with 10s of millions of rows, (big tables as far as I'm concerned!) and there optimisation is more important.

So is there a better query for this, and are inline queries generally a bad idea?

A: 

In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )

Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.

MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.

Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.

$maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id';
$q ="     
    SELECT id, site_id, start, \"end\", duration 
       FROM generator_logs
       WHERE start IN ($maxids) 
       ORDER BY start DESC
";
Kent Fredric
+4  A: 

Should your query not be correlated? i.e.:

SELECT id, site_id, start, "end", duration 
FROM generator_logs g1
WHERE start = (SELECT MAX(g2.start) AS start 
               FROM generator_logs  g2
               WHERE g2.site_id = g1.site_id) 
ORDER BY start DESC

Otherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.

Or alternatively:

SELECT id, site_id, start, "end", duration 
FROM generator_logs g1
WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start 
                           FROM generator_logs  g2
                           GROUP BY site_id)
ORDER BY start DESC
Tony Andrews
+1  A: 

I would use joins as they perform much better then "IN" clause:

select gl.id, gl.site_id, gl.start, gl."end", gl.duration 
from 
    generator_logs gl
    inner join (
     select max(start) as start, site_id
        from generator_logs 
        group by site_id
    ) gl2
     on gl.site_id = gl2.site_id
     and gl.start = gl2.start

Also as Tony pointed out you were missing correlation in your original query

kristof