tags:

views:

169

answers:

2

There are a lot of SQL Top N questions on stackoverflow but I can't seem to find one that matches the situation I'm having. I would like to perform some grouping within a top n query. My data looks like this (obviously with fake values).

MY_DATE    IP_ADDRESS
1/1/09     999.999.999.999
1/1/09     999.999.999.999
1/1/09     999.999.999.998
... a lot more rows

The date range for the table covers several months and has many thousands of rows per month. What I would like to do is have a single query tell me which 10 IP Addresses occurred the most frequently for each month. I can do this for a single month using the following:

SELECT DATE_FORMAT(MY_DATE, '%b-%y') AS "MONTH", IP_ADDRESS, COUNT(*) AS HITS
FROM MY_DATA
WHERE DATE_FORMAT(MY_DATE, '%b-%y') = 'JAN-09'
GROUP BY DATE_FORMAT(MY_DATE, '%b-%y'), IP_ADDRESS
ORDER BY HITS DESC
LIMIT 10

But what I really want is to be able to see the top n for every month in the data set. That essentially prohibits me from using the where clause I specified. Of course, when I do that, then I just get the to 10 for all months. The result I'm looking for should look like this:

MONTH    IP_ADDRESS        COUNT(*)
JAN-09   999.999.999.999   200
JAN-09   999.999.999.998   150
... ( 8 more rows of January )
FEB-09   999.999.999.999   320
FEB-09   999.999.999.998   234
... ( 8 more rows of February)
MAR-09   999.999.999.999   440
... ETC.

Can this be done in MySQL? It seems the barrier I'm hitting is that MySQL doesn't allow an ORDER BY within a query statement included in a UNION. Thanks for the help!

A: 

this is a first rough guess, but try this

Select Month, Address
From  (Select DATE_FORMAT(MY_DATE, '%b-%y') Month, 
       IP_Address Address, Count(*) AddressCount
         From MY_DATA
       Group By DATE_FORMAT(MY_DATE, '%b-%y'), IP_Adress) Z
  Join(Select DATE_FORMAT(MY_DATE, '%b-%y') Month, 
          IP_Address Address, Count(*) AddressCount
       From MY_DATA
       Group By DATE_FORMAT(MY_DATE, '%b-%y'), IP_Adress) ZZ
    On ZZ.Month = Z.Month 
       And ZZ.AddressCount >= Z.AddressCount 
Group By Z.Month, Z.Address
Where Count(ZZ.AddressCount) >= 10
Charles Bretana
Sorry, but that isn't even valid SQL. Where are you doing the limiting?
Mike Farmer
... and your "ON" clause, or associated where clause, is missing resulting in a Cartesian product.
Mike Farmer
@Charles: Change the `JOIN` to `LEFT OUTER JOIN`. Change the `WHERE` of the outer query needs to `HAVING`. And change `>=` to `<=` in that last condition.
Bill Karwin
OK, I get the approach you are taking here. But this query doesn't return any results. Even will Bill's updates. I'm not a novice here so I get what you are trying to do. But this approach for our data set is not very efficient (hundreds of thousands of rows). Also, this query will not return the TOP n. Just the first n. Good try, but not what I'm looking for.
Mike Farmer
+1  A: 

I just tried a query very similar to the one given by @Charles Bretana and it does work. I used a VIEW to help clarify things.

CREATE TABLE my_data (
 my_date DATE,
 ip_address CHAR(15)
);

Insert a bunch of date/IPaddress pairs (not shown)...

Create a view for all counts per month and IP address:

CREATE VIEW my_data_per_month as
 SELECT EXTRACT(YEAR_MONTH FROM my_date) AS month,
   ip_address, COUNT(*) AS hits
 FROM my_data
 GROUP BY month, ip_address;

SELECT * FROM my_data_per_month
ORDER BY month ASC, hits DESC;

+--------+-----------------+------+
| month  | ip_address      | hits |
+--------+-----------------+------+
| 200901 | 999.999.999.999 |    8 | 
| 200901 | 999.999.999.998 |    6 | 
| 200901 | 999.999.999.997 |    5 | 
| 200901 | 999.999.999.996 |    4 | 
| 200901 | 999.999.999.995 |    3 | 
| 200901 | 999.999.999.994 |    2 | 
| 200902 | 999.999.999.998 |    8 | 
| 200902 | 999.999.999.997 |    6 | 
| 200902 | 999.999.999.996 |    5 | 
| 200902 | 999.999.999.995 |    4 | 
| 200902 | 999.999.999.994 |    3 | 
| 200902 | 999.999.999.993 |    2 | 
| 200903 | 999.999.999.997 |    8 | 
| 200903 | 999.999.999.996 |    6 | 
| 200903 | 999.999.999.995 |    5 | 
| 200903 | 999.999.999.994 |    4 | 
| 200903 | 999.999.999.993 |    3 | 
| 200903 | 999.999.999.992 |    2 | 
+--------+-----------------+------+

Now show the top three IP addresses per month:

SELECT m1.month, m1.ip_address, m1.hits
FROM my_data_per_month m1
LEFT OUTER JOIN my_data_per_month m2
  ON (m1.month = m2.month AND m1.hits < m2.hits)
GROUP BY m1.month, m1.ip_address
HAVING COUNT(*) < 3
ORDER BY m1.month ASC, m1.hits DESC;

+--------+-----------------+------+
| month  | ip_address      | hits |
+--------+-----------------+------+
| 200901 | 999.999.999.999 |    8 | 
| 200901 | 999.999.999.998 |    6 | 
| 200901 | 999.999.999.997 |    5 | 
| 200902 | 999.999.999.998 |    8 | 
| 200902 | 999.999.999.997 |    6 | 
| 200902 | 999.999.999.996 |    5 | 
| 200903 | 999.999.999.997 |    8 | 
| 200903 | 999.999.999.996 |    6 | 
| 200903 | 999.999.999.995 |    5 | 
+--------+-----------------+------+
Bill Karwin