views:

555

answers:

5

We have a table that contains website page views, like:

time      | page_id
----------|-----------------------------
1256645862| pageA
1256645889| pageB
1256647199| pageA
1256647198| pageA
1256647300| pageB
1257863235| pageA
1257863236| pageC

In our production table, there is currently about 40K rows. We want to generate, for each day, the count of unique pages viewed in the last 30 days, 60 days, and 90 days. So, in the result set, we can look-up a day, and see how many unique pages were accessed within the 60-day period preceding that day.

We were able to get a query to work in MSSQL:

SELECT DISTINCT
 CONVERT(VARCHAR,P.NDATE,101) AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-29,P.NDATE) AND P.NDATE) AS SUB) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-59,P.NDATE) AND P.NDATE) AS SUB) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE NDATE BETWEEN DATEADD(D,-89,P.NDATE) AND P.NDATE) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

NOTE: because MSSQL doesn't have the FROM_UNIXTIME function, we added the NDATE column for testing, which is just the converted time. NDATE does not exist in the production table.

Converting this query to MySQL gives us the "Unknown colum P.time" error:

SELECT DISTINCT
 FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM (SELECT PAGE_ID FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS SUB) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

I understand this is because we cannot have a correlated subquery that references a table in the outer FROM clause. But, unfortunately, we are at a loss on how to convert this query to work in MySQL. For now, we simply return all DISTINCT rows from the table and post-process it in PHP. Takes about 2-3 seconds for 40K rows. I'm worried about the performance when we have 100's of 1000's of rows.

Is it possible to do in MySQL? If so, can we expect it to perform better than our PHP post-processed solution.

UPDATE: Here's the query for creating the table:

CREATE TABLE  `perflog` (
    `user_id` VARBINARY( 40 ) NOT NULL ,
    `elapsed` float UNSIGNED NOT NULL ,
    `page_id` VARCHAR( 255 ) NOT NULL ,
    `time` INT( 10 ) UNSIGNED NOT NULL ,
    `ip` VARBINARY( 40 ) NOT NULL ,
    `agent` VARCHAR( 255 ) NOT NULL ,
    PRIMARY KEY (  `user_id` ,  `page_id` ,  `time` ,  `ip`,  `agent` )
) ENGINE MyISAM

Our production table has ~40K rows thus far.

A: 

why do you have the subquery buried in a second level like that? try this instead:

SELECT DISTINCT
 FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'
longneck
Thank you for the quick reply. I tried your suggestion (correcting the SELECT references to SUB):It is still *running* after a few minutes. I'll wait it out to see what it returns, but, assuming it returns the correct data, at this point it would take too long to be practical. :(
Chad
A: 

You can try using a single select.

Select the values between the date and 90 days prior only.

Then use a case statement in each fiels to check if the date falls between 30, 60, 90. For each field if the case is true, then 1 else 0, and count those.

Something like

SELECT  SUM(CASE WHEN p.Date IN 30 PERIOD THEN 1 ELSE 0 END) Cnt30,
     SUM(CASE WHEN p.Date IN 60 PERIOD THEN 1 ELSE 0 END) Cnt60,
     SUM(CASE WHEN p.Date IN 90 PERIOD THEN 1 ELSE 0 END) Cnt90
FROM    Table
WHERE p.Date IN 90 PERIOD
astander
Thanks for the reply. I'm not sure how to plug my conditionals into a CASE statement like that, having never used them. My first attempts failed to pass syntax checks. I'll need to do go do some more reading.
Chad
Have a look at this for case statements http://dev.mysql.com/doc/refman/5.0/en/case-statement.html
astander
A: 

Change the subselects into joins, as such:

select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perflog p
  join perflog p30 on FROM_UNIXTIME(p30.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p60 on FROM_UNIXTIME(p60.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p90 on FROM_UNIXTIME(p90.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')

However, that's likely to run slowly because of the pile of functions killing any indicies on your date columns, a better solution might be:

create temporary table perf_tmp as
select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'VIEWDATE',
  page_id
from
  perflog;

create index perf_dt on perf_tmp (VIEWDATE);

select
  VIEWDATE, 
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perf_tmp p
  join perf_tmp p30 on p30.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 30 DAY) AND p.VIEWDATE
  join perf_tmp p60 on p60.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 60 DAY) AND p.VIEWDATE
  join perf_tmp p90 on p90.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 90 DAY) AND p.VIEWDATE;
Donnie
Thanks Donnie. The query is running now... for about 5 minutes. :( I'll wait it out and see if it returns the expected/desired data.
Chad
The likely problem is the fact that you're being forced to wrap all of the dates in function calls. This means that it can't use indices on those fields and you end up doing a ton of table scans. If you can find a way around that your perf will go way up.
Donnie
Something isn't quite right. We're still executing after ~30 minutes.
Chad
See my edit to the solution above.
Donnie
Thanks for the example. Using it, I ran into a problem with opening the temp table multiple times (once for each join). So, I created 4 temp tables (each identical), and altered the joins to use those.After an hour, the query is still running. :(
Chad
Add an index to page_id for each temp table as well.Other than that, I can't see any reason why it should go so slow.
Donnie
Thanks Donnie. I think there is something flawed with either what I am trying to do, or how MySQL supports it. The last run was canceled after 3 hours. Adding the index isn't making a practical difference in this case (still running at 5 minutes). If we can't make any breakthroughs in MySQL, at least I have the PHP fallback. It just bugs me that if I have the algorithm to do it in PHP or MSSQL (in ~2-3 seconds), there should be a way in MySQL. But, I can understand that MySQL and MSSQL have differences in features and limitations. What a nice way to go into the weekend.
Chad
I've had a few day+ MySQL queries turned into a few seconds by breaking them down into temp tables. If I have time this weekend I'll play with it some locally.
Donnie
I altered your first example (no temp tables) to do comparisons on the time stamp (e.g., 'p60.time BETWEEN (p.time - 5184000) AND p.time') instead of using the FROM_UNIXTIME() function calls. Still no luck.
Chad
A: 

This is the PHP I use to solve this problem. Ideally, I would want this all done by MySQL (if it can be done faster). I only post this as further clarification of the task:

function getUniqueUsage($field = 'page_id', $since = 90){
    //we need to add 90 days onto our date range for the 90-day sum
    $sinceSeconds = mktime(0, 0, 0, $m , $d, $y) - (($sinceDays + 90) * (60 * 60 * 24));
    //==> omitting mySQL connection details<==
    $sql = "SELECT DISTINCT From_unixtime(time,'%Y-%m-%d') AS date, $field FROM perflog WHERE time > $sinceSeconds ORDER BY date" ;
    $sql_results = mysql_query($sql);
    $results = array();
    //all page ids per date (ending-up with only unique date keys)
    while ($row = mysql_fetch_assoc($sql_results))
    {
     $results[$row['date']][] = $row[$field];
    }
    $sums = array();
    //initialize sum array, with only unique dates (days)
    foreach (array_keys($results) as $date){
     $sums[$date] = array(0,0,0);
    }
    //calculate the 30/60/90 day unique pages for each day
    foreach (array_keys($sums) as $ref_date){
     $merges30 = array();
     $merges60 = array();
     $merges90 = array();
     $ref_time = strtotime($ref_date);
     $ref_minus_30 = strtotime("-30 Days",$ref_time);
     $ref_minus_60 = strtotime("-60 Days",$ref_time);
     $ref_minus_90 = strtotime("-90 Days",$ref_time);
     foreach ($results as $result_date => $pages){
      $compare_time = strtotime($result_date);
      if ($compare_time >= $ref_minus_30 && $compare_time <= $ref_time){
       $merges30 = array_merge($merges30, $pages);
      }
      if ($compare_time >= $ref_minus_60 && $compare_time <= $ref_time){
       $merges60 = array_merge($merges60, $pages);
      }
      if ($compare_time >= $ref_minus_90 && $compare_time <= $ref_time){
       $merges90 = array_merge($merges90, $pages);
      }
     }
     $sums[$ref_date] = array(count(array_unique($merges30)),count(array_unique($merges60)),count(array_unique($merges90)));
    }
    //truncate to only specified number of days
    return array_slice($sums,-$since, $since, true);
}

As you can see, a lot of unfortunate array merge-ing and array unique-ing.

Chad
A: 

Note: I am writing this after reading solutions by @astander, @Donnie, @longneck.

I understand that performance is important, but why don't you store aggregates? Ten years of day-per-row is 3650 rows with only few columns each.

TABLE dimDate (DateKey int (PK), Year int, Day int, DayOfWeek varchar(10), DayInEpoch....)
TABLE AggVisits (DateKey int (PK,FK), Today int, Last30 int, Last60 int, Last90 int)

This way you would run the query only once at the end of the day, for one day only. Pre-calculated aggregates are at the root of any high-performance analytic solution (cubes).

UPDATE:
You could speed up those queries by introducing another column DayInEpoch int (day number since say 1990-01-01). Then you can remove all those date/time conversion functions.

Damir Sudarevic
Good question. Since I need the count of *unique* pages for 30/60/90 days I can't store an aggregate page count for each day. I need each page associated with an individual date so I can calculate the uniques. If I sum-up each day's unique pages, I lose the uniqueness.The data in the table is also used for other means. The sample I provided is simplified. I also store the performance data for that single page (how long it took the user to load the page) as well as browser, IP, and user name data. Those are other reasons that I need one row per access (instead of an aggregate).
Chad
Rereading, I see now you are storing the "pre-calculated aggregates" (exactly as you typed, sorry for the oversight). That is a very interesting proposal. I like it, but also want to minimize the number of *scheduled tasks* we maintain.
Chad