tags:

views:

351

answers:

6

I feel like I've seen this question asked before, but neither the SO search nor google is helping me... maybe I just don't know how to phrase the question. I need to count the number of events (in this case, logins) per day over a given time span so that I can make a graph of website usage. The query I have so far is this:

select 
   count(userid) as numlogins, 
   count(distinct userid) as numusers, 
   convert(varchar, entryts, 101) as date 
from 
   usagelog 
group by 
   convert(varchar, entryts, 101)

This does most of what I need (I get a row per date as the output containing the total number of logins and the number of unique users on that date). The problem is that if no one logs in on a given date, there will not be a row in the dataset for that date. I want it to add in rows indicating zero logins for those dates. There are two approaches I can think of for solving this, and neither strikes me as very elegant.

  1. Add a column to the result set that lists the number of days between the start of the period and the date of the current row. When I'm building my chart output, I'll keep track of this value and if the next row is not equal to the current row plus one, insert zeros into the chart for each of the missing days.
  2. Create a "date" table that has all the dates in the period of interest and outer join against it. Sadly, the system I'm working on already has a table for this purpose that contains a row for every date far into the future... I don't like that, and I'd prefer to avoid using it, especially since that table is intended for another module of the system and would thus introduce a dependency on what I'm developing currently.

Any better solutions or hints at better search terms for google? Thanks.

A: 
WITH q(n) AS
          (
          SELECT  0
          UNION   ALL
          SELECT  n + 1
          FROM    q
          WHERE   n < 99
          ),
    qq(n) AS 
          (
          SELECT  0
          UNION   ALL
          SELECT  n + 1
          FROM    q
          WHERE   n < 99
          ),
    dates AS
          (
          SELECT  q.n * 100 + qq.n AS ndate
          FROM    q, qq
          )
SELECT    COUNT(userid) as numlogins,
          COUNT(DISTINCT userid) as numusers,
          CAST('2000-01-01' + ndate AS DATETIME) as date
FROM      dates
LEFT JOIN
          usagelog
ON        entryts >= CAST('2000-01-01' AS DATETIME) + ndate
          AND entryts < CAST('2000-01-01' AS DATETIME) + ndate + 1
GROUP BY
          ndate

This will select up to 10,000 dates constructed on the fly, that should be enough for 30 years.

SQL Server has a limitation of 100 recursions per CTE, that's why the inner queries can return up to 100 rows each.

If you need more than 10,000, just add a third CTE qqq(n) and cross-join with it in dates.

Quassnoi
SQL Server does not have a limitation of 100 rows per CTE. I think it has a limit of 100 recursions in a CTE, but that is very different.
Tom H.
Just checked, and actually the DEFAULT limit is 100 recursions. You can set that with MAXRECURSION up to 32,767
Tom H.
Sure, you're right
Quassnoi
A: 

Create a memory table (a table variable) where you insert your date ranges, then outer join the logins table against it. Group by your start date, then you can perform your aggregations and calculations.

Adam Robinson
+1  A: 

The strategy I normally use is to UNION with the opposite of the query, generally a query that retrieves data for rows that don't exist.

If I wanted to get the average mark for a course, but some courses weren't taken by any students, I'd need to UNION with those not taken by anyone to display a row for every class:

SELECT AVG(mark), course FROM `marks` 
    UNION
SELECT NULL, course FROM courses WHERE course NOT IN
    (SELECT course FROM marks)

Your query will be more complex but the same principle should apply. You may indeed need a table of dates for your second query

David Caunt
+3  A: 

Frankly, I'd do this programmatically when building the final output. You're essentially trying to read something from the database which is not there (data for days that have no data). SQL isn't really meant for that sort of thing.

If you really want to do that, though, a "date" table seems your best option. To make it a bit nicer, you could generate it on the fly, using i.e. your DB's date functions and a derived table.

sleske
This is also a good idea :)
David Caunt
I ended up using the existing date table in my app and just tolerating the dependency it introduced. It was the fastest solution.
rmeador
A: 

Option 1 You can create a temp table and insert dates with the range and do a left outer join with the usagelog Option 2 You can programmetically insert the missing dates while evaluating the result set to produce the final output

kishore
+1  A: 

I had to do exactly the same thing recently. This is how I did it in T-SQL ( YMMV on speed, but I've found it performant enough over a coupla million rows of event data):

DECLARE @DaysTable TABLE ( [Year] INT, [Day] INT )

DECLARE @StartDate DATETIME
SET @StartDate = whatever

WHILE (@StartDate <= GETDATE())
BEGIN

  INSERT INTO @DaysTable ( [Year], [Day] )
  SELECT DATEPART(YEAR, @StartDate), DATEPART(DAYOFYEAR, @StartDate)

  SELECT @StartDate = DATEADD(DAY, 1, @StartDate)
END

-- This gives me a table of all days since whenever
-- you could select @StartDate as the minimum date of your usage log)

SELECT days.Year, days.Day, events.NumEvents
FROM @DaysTable AS days
LEFT JOIN (
  SELECT
 COUNT(*) AS NumEvents
 DATEPART(YEAR, LogDate) AS [Year],
 DATEPART(DAYOFYEAR, LogDate) AS [Day]
  FROM LogData
  GROUP BY
 DATEPART(YEAR, LogDate),
 DATEPART(DAYOFYEAR, LogDate)
) AS events ON days.Year = events.Year AND days.Day = events.Day
Keith Williams