ansaurus

Question

MySQL Query - Using Aggregate and Group By to generate separate results

Answer 1

A:

Check if this will make it more clear for you

SELECT archive_asset_id, AVG(actual_percent) 
FROM (SELECT id, archive_asset_id, asset_title, 
             MAX(view_percent) as actual_percent 
      FROM log_embed_video GROUP by id) T 
GROUP BY archive_asset_id;

It returns:

+------------------+---------------------+
| archive_asset_id | AVG(actual_percent) |
+------------------+---------------------+
|            83386 |               36.75 | 
+------------------+---------------------+

A few notes

this will not perform well on 100M records
also you might want to normalize your data to improve the performance (which it will do in this case; basically moving the actual final rows into their own table makes much more sense to me)
the expression COUNT(DISTINCT id * 1000000 + archive_asset_id) caught my eye as something bizarre; are you sure you don't mean simply COUNT(*) or COUNT(id)?

EDIT:

For the second one

SELECT archive_asset_id, actual_percent, count(*) 
FROM (SELECT id, archive_asset_id, asset_title,               
             MAX(view_percent) as actual_percent        
      FROM log_embed_video GROUP by id) T  
GROUP BY archive_asset_id, actual_percent;

+------------------+----------------+----------+
| archive_asset_id | actual_percent | count(*) |
+------------------+----------------+----------+
|            83386 |             13 |        1 | 
|            83386 |             17 |        2 | 
|            83386 |            100 |        1 | 
+------------------+----------------+----------+

Unreason 2010-09-20 15:58:28

`COUNT(DISTINCT id * 1000000 + archive_asset_id)` I assume is a kludge to get to `COUNT(DISTINCT id,archive_asset_id)`

Wrikken 2010-09-20 15:59:40

We are using a Infobright Brighthouse database that is designed for querying large sets of data. Currently that database does not support COUNT(DISTINCT col1, col2) so that is their recommended work around

Scott 2010-09-20 16:09:52

@ Unreason - both of these modified queries work quite well. When filtering the data down by date range and by a single asset id the results are returned in less than 200 ms with over 100+ million records in this table. Thank you for taking the time to answer my question!

Scott 2010-09-20 16:36:14

@Scot; ah I see the logic for the DISTINCT.. well that's not necessary either now.. (P.S see faq re voting)

Unreason 2010-09-20 18:34:49

Answer 2

+1 A:

All max-percentage per-id rows for a unique id:

SELECT a.* 
FROM log_embed_video a 
LEFT JOIN log_embed_video b
ON b.id = a.id
AND b.view_percent > a.view_percent
WHERE b.id IS NULL
-- possibly limit on date for  more performance.

Performance wise this is better:

SELECT * FROM (
    SELECT id, archive_asset_id, asset_title, view_percent, created,
        @rn := IF(id != @old_id,1,@rn + 1) as rownumber,
        @old_id := id 
    FROM log_embed_video 
    JOIN (SELECT @rn:=0,@old_id:=0) void
    ORDER BY id, view_percent DESC
) a WHERE rownumber=1;

Wrikken 2010-09-20 16:04:37

+1 because it works :) I do wonder which solution will Scott find easier to understand...

Unreason 2010-09-20 16:12:29

Readability suffers if isn't familiar with the LEFT JOIN trick indeed. It also suffers a bit performance-wise. I might enter an even more unreadable answer that is a lot quicker...

Wrikken 2010-09-20 16:16:41

There, now all readability is out the window in favor of performance :P

Wrikken 2010-09-20 16:22:54

@ Wrikken - thank for submitting this query, but this does not yield the desired result. Both of the modified queries submitted by Unreason works for what I am trying to accomplish. Thank you!!!

Scott 2010-09-20 16:32:05

@Scott: those were the generic 'I want the row with the MIN/MAX value of a set, with the related data from the same row'. Indeed not the full queries you'd want, but both result in the proper 'Conditioned Data' you mention and are handy in anyones (My)SQL's toolset, especially the performance (on normal InnoDB / MyISAM tables at least) of the second one. But no worries, Unreasons' answer indeed does work properly.

Wrikken 2010-09-20 16:36:10

@Wrikken - for my last comment, I did not see your second query under "Performance wise this is better:" I will evaluate this query as well. Regarding the conditioned data, indeed your first query does return those 4 records.

Scott 2010-09-20 16:40:33

Wish I had another vote to give you, it's been a while since I've seen a new solution for MIN/MAX row. I mean, had all the concepts, but it would not occur to me to try THAT to improve performance :) But seriously, my original question was sincere, I was really wondering which path would be more, lets say natural to OP.

Unreason 2010-09-20 18:49:17

A sincere answer to that question is: although both my solutions are valid, history teaches me that those two almost never get any votes from users, although the first one does get some points from the _'I think subqueries are bad by default'_ - crowd (of which I have been a member). The user-defined variables in the second one _really_ scare most people off ;)

Wrikken 2010-09-20 19:07:42

ansaurus

tags:

views:

answers:

MySQL Query - Using Aggregate and Group By to generate separate results

related questions