tags:

views:

59

answers:

2

Here's a puzzler for you:

I'm keeping stats of cluster computing stuff in a MySQL table named 'jobs'. Each job row has a host the job executed on (not unique), a job execution time in seconds, and a unique integer as the PK so I can order the completed jobs simply by ordering the PK.

As of right now, using average and group by, I can find the average execution time in seconds for each host over all of the jobs completed. Instead of averaging all the execution times per host, I want the average time of the last five jobs per host.

There's all sorts of examples for operations and group by, and lots of examples for operations with limit, but is there any way of combining the two in a fairly straightforward MySQL query?

EDIT: In the event that I'm not clear about it, I want the average five execution times for host 1, and the average five execution times for host 2, etc.

+1  A: 

My initial reaction was to use LIMIT to restrict the average to 5 results, which led me to suggest:

select a.host, avg(a.execution_time) from (select id, execution_time, host from jobs order by id desc limit 5) a group by a.host;

But it is clear that this limits the average to the most recent 5 jobs, and not the most recent 5 jobs per host.

It seems difficult to use LIMIT to restrict the average, without using some kind of stored procedure. This led me to consider assigning each job a per-host completion order, or position, using a mysql variable.

This is untested, but the theory it illustrates should be a good starting point:

First, we should assign each job a position based on its host:

select
  host, 
  execution_time,
  @current_pos := if (@current_host = host, @current_pos, 0) + 1 as position,
  @current_host := host
from
  (select @current_host := null, @current_pos := 0) set_pos,
  jobs
order by
  host,
  id desc;

After establishing the position, just select the aggregate function, restricting results to the top 5 positions:

select
  jt.host,
  avg(jt.execution_time)
from
  (
  select
    host, 
    execution_time,
    @current_pos := if (@current_host = host, @current_pos, 0) + 1 as position,
    @current_host := host
  from
    (select @current_host := null, @current_pos := 0) set_pos,
    jobs
  order by
    host,
    id desc
  ) jt
where
  jt.position <= 5
group
  by host;

Please let me know if this works for you, or if there are more aspects I have not considered. This is an intriguing problem.

Brandon Horsley
A small problem; you're running into the same problem I had, the nested query will result in only five hosts being returned.
Rob
Yes, I see the problem was more complicated than I initially thought. I have updated my solution to attempt to address this.
Brandon Horsley
Nice! Works perfect for me.
Rob
+1  A: 

I want the average five execution times for host 1, and the average five execution times for host 2, etc.

Oh... In that case, use:

SELECT x.host, AVG(x.execution_time)
  FROM (SELECT j.pk,
               j.host,
               j.execution_time,
               CASE 
                 WHEN @host != j.host THEN @rownum := 1 
                 ELSE @rownum := @rownum + 1 
               END AS rank,
               @host := j.host
          FROM JOBS j
          JOIN (SELECT @rownum := 0; @host := '') r
      ORDER BY j.host, j.execution_time DESC) x
 WHERE x.rank <= 5
GROUP BY x.host

MySQL doesn't have any ranking/analytical/windowing functionality, but does support variables so you can get the same functionality at ROW_NUMBER() OVER (PARTITION BY host ORDER BY execution_time DESC).

Previously:

      SELECT AVG(j.execution_time) AS avg_last_five_jobs
        FROM JOBS j
        JOIN (SELECT t.pk
                    FROM JOBS t
             ORDER BY t.pk DESC
                    LIMIT 5) x ON x.pk = j.pk
OMG Ponies
I've had no luck with this, either. I only get one result back, and it's not the average.
Rob
@Rob: Updated in light of clarification
OMG Ponies
Hmm...I had a couple problems with the above solution so far, I'll probably play with it some more, though.
Rob