ansaurus

Question

MySQL: Rewrite MSSQL with correlated subquery in FROM clause

Answer 1

A:

why do you have the subquery buried in a second level like that? try this instead:

SELECT DISTINCT
 FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE', 
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '30D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '60D',
 (SELECT COUNT(DISTINCT SUB.PAGE_ID) FROM perflog WHERE FROM_UNIXTIME(time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')) AS '90D'
FROM PERFLOG P
ORDER BY 'DATE'

longneck 2009-11-20 15:55:30

Thank you for the quick reply. I tried your suggestion (correcting the SELECT references to SUB):It is still *running* after a few minutes. I'll wait it out to see what it returns, but, assuming it returns the correct data, at this point it would take too long to be practical. :(

Chad 2009-11-20 16:03:35

Answer 2

A:

You can try using a single select.

Select the values between the date and 90 days prior only.

Then use a case statement in each fiels to check if the date falls between 30, 60, 90. For each field if the case is true, then 1 else 0, and count those.

Something like

SELECT  SUM(CASE WHEN p.Date IN 30 PERIOD THEN 1 ELSE 0 END) Cnt30,
     SUM(CASE WHEN p.Date IN 60 PERIOD THEN 1 ELSE 0 END) Cnt60,
     SUM(CASE WHEN p.Date IN 90 PERIOD THEN 1 ELSE 0 END) Cnt90
FROM    Table
WHERE p.Date IN 90 PERIOD

astander 2009-11-20 16:01:18

Thanks for the reply. I'm not sure how to plug my conditionals into a CASE statement like that, having never used them. My first attempts failed to pass syntax checks. I'll need to do go do some more reading.

Chad 2009-11-20 16:20:18

Have a look at this for case statements http://dev.mysql.com/doc/refman/5.0/en/case-statement.html

astander 2009-11-20 16:21:33

Answer 3

A:

Change the subselects into joins, as such:

select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'DATE',
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perflog p
  join perflog p30 on FROM_UNIXTIME(p30.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 30 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p60 on FROM_UNIXTIME(p60.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 60 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')
  join perflog p90 on FROM_UNIXTIME(p90.time,'%Y-%m-%d') BETWEEN DATE_SUB(FROM_UNIXTIME(P.time,'%Y-%m-%d'), INTERVAL 90 DAY) AND FROM_UNIXTIME(P.time,'%Y-%m-%d')

However, that's likely to run slowly because of the pile of functions killing any indicies on your date columns, a better solution might be:

create temporary table perf_tmp as
select
  FROM_UNIXTIME(P.time,'%Y-%m-%d') AS 'VIEWDATE',
  page_id
from
  perflog;

create index perf_dt on perf_tmp (VIEWDATE);

select
  VIEWDATE, 
  count(distinct p30.page_id) AS '30D',
  count(distinct p60.page_id) AS '60D',
  count(distinct p90.page_id) AS '90D'
from
  perf_tmp p
  join perf_tmp p30 on p30.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 30 DAY) AND p.VIEWDATE
  join perf_tmp p60 on p60.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 60 DAY) AND p.VIEWDATE
  join perf_tmp p90 on p90.VIEWDATE BETWEEN DATE_SUB(P.VIEWDATE, INTERVAL 90 DAY) AND p.VIEWDATE;

Donnie 2009-11-20 16:09:36

Thanks Donnie. The query is running now... for about 5 minutes. :( I'll wait it out and see if it returns the expected/desired data.

Chad 2009-11-20 16:22:56

The likely problem is the fact that you're being forced to wrap all of the dates in function calls. This means that it can't use indices on those fields and you end up doing a ton of table scans. If you can find a way around that your perf will go way up.

Donnie 2009-11-20 16:30:32

Something isn't quite right. We're still executing after ~30 minutes.

Chad 2009-11-20 17:12:27

See my edit to the solution above.

Donnie 2009-11-20 17:19:45

Thanks for the example. Using it, I ran into a problem with opening the temp table multiple times (once for each join). So, I created 4 temp tables (each identical), and altered the joins to use those.After an hour, the query is still running. :(

Chad 2009-11-20 18:40:28

Add an index to page_id for each temp table as well.Other than that, I can't see any reason why it should go so slow.

Donnie 2009-11-20 18:45:34

Thanks Donnie. I think there is something flawed with either what I am trying to do, or how MySQL supports it. The last run was canceled after 3 hours. Adding the index isn't making a practical difference in this case (still running at 5 minutes). If we can't make any breakthroughs in MySQL, at least I have the PHP fallback. It just bugs me that if I have the algorithm to do it in PHP or MSSQL (in ~2-3 seconds), there should be a way in MySQL. But, I can understand that MySQL and MSSQL have differences in features and limitations. What a nice way to go into the weekend.

Chad 2009-11-20 21:33:19

I've had a few day+ MySQL queries turned into a few seconds by breaking them down into temp tables. If I have time this weekend I'll play with it some locally.

Donnie 2009-11-21 00:29:14

I altered your first example (no temp tables) to do comparisons on the time stamp (e.g., 'p60.time BETWEEN (p.time - 5184000) AND p.time') instead of using the FROM_UNIXTIME() function calls. Still no luck.

Chad 2009-11-23 15:37:07

Answer 4

A:

This is the PHP I use to solve this problem. Ideally, I would want this all done by MySQL (if it can be done faster). I only post this as further clarification of the task:

function getUniqueUsage($field = 'page_id', $since = 90){
    //we need to add 90 days onto our date range for the 90-day sum
    $sinceSeconds = mktime(0, 0, 0, $m , $d, $y) - (($sinceDays + 90) * (60 * 60 * 24));
    //==> omitting mySQL connection details<==
    $sql = "SELECT DISTINCT From_unixtime(time,'%Y-%m-%d') AS date, $field FROM perflog WHERE time > $sinceSeconds ORDER BY date" ;
    $sql_results = mysql_query($sql);
    $results = array();
    //all page ids per date (ending-up with only unique date keys)
    while ($row = mysql_fetch_assoc($sql_results))
    {
     $results[$row['date']][] = $row[$field];
    }
    $sums = array();
    //initialize sum array, with only unique dates (days)
    foreach (array_keys($results) as $date){
     $sums[$date] = array(0,0,0);
    }
    //calculate the 30/60/90 day unique pages for each day
    foreach (array_keys($sums) as $ref_date){
     $merges30 = array();
     $merges60 = array();
     $merges90 = array();
     $ref_time = strtotime($ref_date);
     $ref_minus_30 = strtotime("-30 Days",$ref_time);
     $ref_minus_60 = strtotime("-60 Days",$ref_time);
     $ref_minus_90 = strtotime("-90 Days",$ref_time);
     foreach ($results as $result_date => $pages){
      $compare_time = strtotime($result_date);
      if ($compare_time >= $ref_minus_30 && $compare_time <= $ref_time){
       $merges30 = array_merge($merges30, $pages);
      }
      if ($compare_time >= $ref_minus_60 && $compare_time <= $ref_time){
       $merges60 = array_merge($merges60, $pages);
      }
      if ($compare_time >= $ref_minus_90 && $compare_time <= $ref_time){
       $merges90 = array_merge($merges90, $pages);
      }
     }
     $sums[$ref_date] = array(count(array_unique($merges30)),count(array_unique($merges60)),count(array_unique($merges90)));
    }
    //truncate to only specified number of days
    return array_slice($sums,-$since, $since, true);
}

As you can see, a lot of unfortunate array merge-ing and array unique-ing.

Chad 2009-11-20 17:06:52

Answer 5

A:

Note: I am writing this after reading solutions by @astander, @Donnie, @longneck.

I understand that performance is important, but why don't you store aggregates? Ten years of day-per-row is 3650 rows with only few columns each.

TABLE dimDate (DateKey int (PK), Year int, Day int, DayOfWeek varchar(10), DayInEpoch....)
TABLE AggVisits (DateKey int (PK,FK), Today int, Last30 int, Last60 int, Last90 int)

This way you would run the query only once at the end of the day, for one day only. Pre-calculated aggregates are at the root of any high-performance analytic solution (cubes).

UPDATE:
You could speed up those queries by introducing another column DayInEpoch int (day number since say 1990-01-01). Then you can remove all those date/time conversion functions.

Damir Sudarevic 2009-11-20 21:00:01

Good question. Since I need the count of *unique* pages for 30/60/90 days I can't store an aggregate page count for each day. I need each page associated with an individual date so I can calculate the uniques. If I sum-up each day's unique pages, I lose the uniqueness.The data in the table is also used for other means. The sample I provided is simplified. I also store the performance data for that single page (how long it took the user to load the page) as well as browser, IP, and user name data. Those are other reasons that I need one row per access (instead of an aggregate).

Chad 2009-11-20 21:22:53

Rereading, I see now you are storing the "pre-calculated aggregates" (exactly as you typed, sorry for the oversight). That is a very interesting proposal. I like it, but also want to minimize the number of *scheduled tasks* we maintain.

Chad 2009-11-20 21:46:56

ansaurus

tags:

views:

answers:

MySQL: Rewrite MSSQL with correlated subquery in FROM clause

related questions