views:

56

answers:

5

I think this is a pretty common issue, but I don't know what the process is called, so I'll describe it with an example. The concept is that I want to join a sparse dataset to a complete series, such as the days of the week, months of the year, or any ordered set (for example, for ranking). Empty positions in the sparse data will show as NULL alongside the complete series.

Let's say I run the following query in SQL Server to find out monthly sales.

SELECT
    YEAR([timestamp]),
    MONTH([timestamp]),
    COUNT(*)
FROM table1
WHERE YEAR([timestamp]) = YEAR(GETDATE())
GROUP BY
    YEAR([timestamp]),
    MONTH([timestamp])
ORDER BY
    YEAR([timestamp]) DESC,
    MONTH([timestamp]) DESC;

If, however, sales only occurred in May and August of this year, for example, then the return result would look like this:

2010    August    1234
2010    May       5678

I want my return result set to look like this:

2010    January
2010    February
2010    March
2010    April
2010    May        1234
2010    June
2010    July
2010    August     5678
2010    September
2010    October
2010    November
2010    December

The only way I know to do this is this:

SELECT
    YEAR(GETDATE()),
    month_index.month_name,
    sales_data.sales
FROM (
    SELECT 'January' as month_name, 1 as month_number
    UNION
    SELECT 'February', 2
    UNION
    SELECT 'March', 3
    UNION
    SELECT 'April', 4
    UNION
    SELECT 'May', 5
    UNION
    SELECT 'June', 6
    UNION
    SELECT 'July', 7
    UNION
    SELECT 'August', 8
    UNION
    SELECT 'September', 9
    UNION
    SELECT 'October', 10
    UNION
    SELECT 'November', 11
    UNION
    SELECT 'December', 12
) as month_index
LEFT JOIN (
    SELECT
        YEAR([timestamp]) AS year_name,
        MONTH([timestamp]) AS month_name,
        COUNT(*) AS sales
    FROM table1
    WHERE YEAR([timestamp]) = GETDATE()
    GROUP BY
        YEAR([timestamp]),
        MONTH([timestamp])
) AS sales_data
ON month_index.month_name = sales_data.month_name
ORDER BY
    month_index.month_number DESC;

Is there a better way to create complete date and alphanumeric series to join data onto? And what is this called?

Thanks!

+1  A: 

how about making a new table called Months: then populating it with data you can join to?

Randy
Thanks for the suggestion Randy. I have done that. I've created many "complete series" tables and joined on them. I was hoping there was a better way. Also, I was hoping someone could tell me what this concept is called.
Kuyenda
+2  A: 

Queries like this are one of the main reasons that many experienced DBAs or database programmers keep a calendar table in their databases.

Toby
+1 for the link to calendar table.
Gilbert Le Blanc
I have also kept a table of a through z and 0 through 999999. That was for PHP. Thanks!
Kuyenda
A calendar/date/numbers table is the only means of doing this on SQL Server 2000 or older. For SQL Server 2005+, a recursive CTE is more efficient.
OMG Ponies
The recursive CTE is nice, but the 100 loop cap is a bummer.
Kuyenda
+2  A: 

try something like this:

DECLARE @StartDate datetime
       ,@EndDate datetime
SELECT @StartDate=DATEADD(month,-6,DATEADD(month,DATEDIFF(month,0,GETDATE()),0) )
      ,@EndDate=GETDATE()

;with AllDates AS
(
    SELECT @StartDate AS DateOf
    UNION ALL
    SELECT DateAdd(month,1,DateOf)
        FROM AllDates
    WHERE DateOf<@EndDate
)
SELECT * FROM AllDates

output:

DateOf
-----------------------
2009-12-01 00:00:00.000
2010-01-01 00:00:00.000
2010-02-01 00:00:00.000
2010-03-01 00:00:00.000
2010-04-01 00:00:00.000
2010-05-01 00:00:00.000
2010-06-01 00:00:00.000
2010-07-01 00:00:00.000

(8 row(s) affected)
KM
+2  A: 

I'm with KM, that for SQL Server 2005+ you could use a recursive CTE:

WITH months AS (
  SELECT DATENAME(mm, '2010-01-01') AS month_name, 
         MONTH('2010-01-01') AS month_number, 
         CAST('2010-01-01' AS DATETIME) AS dt
  UNION ALL
  SELECT DATENAME(mm, DATEADD(mm, 1, m.dt)),
         MONTH(DATEADD(mm, 1, m.dt)),
         DATEADD(mm, 1, m.dt)
    FROM months m
   WHERE DATEADD(mm, 1, m.dt) <= '2010-12-01')
   SELECT x.month_name,
          y.*
     FROM months x
LEFT JOIN your_table y ON MONTH(y.date) = x.month_number

After all, the last time KM & chatted about this - we found a recursive CTE to be slightly more efficient than using a numbers table.

OMG Ponies
+2  A: 

I like this approach to build the months table:

SELECT 
  DATENAME(mm, date_val) AS month_name,  
  MONTH(date_val) AS month_number,  
  date_val as dt
FROM ( 
  SELECT DATEADD(mm, number, '2010-01-01') AS date_val
  FROM master.dbo.spt_values
  WHERE type = 'P'
  AND number BETWEEN 0 AND 11
) months

Based on my tests, it's faster than a CTE. I'm running SQL Server 2008 Express.

Here are the test results, using SET STATISTICS IO ON and SET STATISTICS TIME ON

CTE:

(12 row(s) affected)
Table 'Worktable'. Scan count 2, logical reads 73, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 64 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Subquery:

(12 row(s) affected)
Table 'spt_values'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 4 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Although your original question is asking what this is called. I don't know a name for it; maybe something like "left outer join against a series?"

One additional part to add: when you join against the months table, or even when do your original query, it's typically recommended to avoid using a function like YEAR([timestamp]) on the left side of your WHERE clause.

So this code:

SELECT                     
    YEAR([timestamp]),                     
    MONTH([timestamp]),                     
    COUNT(*)                     
FROM table1                     
WHERE YEAR([timestamp]) = YEAR(GETDATE())                     
GROUP BY                     
    YEAR([timestamp]),                     
    MONTH([timestamp])

...will cause an index scan (assuming timestamp is indexed) because of the YEAR([timestamp]) must be evaluted for every row. On a 1m+ row table, this will mean poor performance.

So you will typically see a recommendation like this instead:

SELECT                     
    YEAR([timestamp]),                     
    MONTH([timestamp]),                     
    COUNT(*)                     
FROM #table1                     
WHERE [timestamp] >= DATEADD(YY, DATEDIFF(YY, 0, GETDATE()), 0) -- First day of this year
AND   [timestamp] < DATEADD(YY, DATEDIFF(YY, 0, GETDATE()) + 1, 0) -- First day of next year
GROUP BY                     
    YEAR([timestamp]),                     
    MONTH([timestamp])

This will use an index seek (again, assuming timestamp is an indexed column) and result in fewer logical reads and thus a faster response. This can be confirmed by checking the execution plan.

8kb
Thanks 8kb! What about creating a memory table by looping through whatever integers or other values you need and left joining on that? The memory table approach seems more flexible, and you always know what's in the table. One thing I don't like about the spt_values and recursive CTE based solutions is the limit. There's no limit on a memory table.
Kuyenda
In my experience, a Numbers table or Tally table is a typically considered a best practice so I would agree. But in terms of limitations, with a CTE you can set the MAXRECURSION level up to 32676 which seems more than adequate for dates. I used MAXRECURSION 2000 on OMG's query and was able to get 100 years worth of months. But for spt_values, it's true you're limited to 2048 rows.
8kb