Using the Stack Overflow public data dump, I've created three simple tables:
- Questions (Question_Id, View_Count, Creation_Date)
- Tags (Tag_Name)
- QuestionTags (Question_Id, Tag_Name)
The Questions table has hundreds of thousands of rows with Creation_Date spanning from a year ago to today. Looking over the data, there are two notable trends:
- Number of Questions Increase by Period - for example, there were more questions asked this month than three months ago
- Questions Views have a Long Tail - by looking at views based on weeks open, we can see that most of a question's views occur in the first week; a lesser amount in the second and third; and a long, constant tail in the following weeks
If neither of these factors came in to play, it'd be fairly trivial to estimate traffic for a given tag (or group of tags) over a month:
SELECT YEAR(Q.Creation_Date)
,MONTH(Q.Creation_Date)
,SUM( Q.View_Count / DATEDIFF(m,Q.Creation_Date,GETDATE()) )
FROM Questions Q
JOIN QuestionTags QT
ON Q.Question_Id = QT.Question_Id
WHERE QT.Tag_Name IN ('c#','.net', ... )
GROUP BY YEAR(Q.Creation_Date), MONTH(Q.Creation_Date)
ORDER BY 1,2
But because of aforementioned factors (especially the long tail), I'm not sure how to approximate views. My thoughts are to create a function that, using the long tail formula, will calculates views for a month based on the current number of views and weeks open.
Here's what I came up with to find the tail:
DECLARE @SDTE DATETIME, @EDTE DATETIME
SELECT @SDTE = '2009-01-11' -- after new years holiday
,@EDTE = CAST( MAX([Creation_Date]) AS INT )
FROM [Questions]
SELECT [DaysOpen_Count]
,AVG( [WView_Count] )
FROM
(
SELECT QT.[Tag_Name],
Q.[View_Count],
[DaysOpen_Count] = DATEDIFF(DAY, Q.[Creation_Date], @EDTE),
[WView_Count] = CAST( Q.[View_Count] / ( DATEDIFF(DAY, Q.[Creation_Date], @EDTE) / 7.0 ) AS INT )
FROM [Questions] Q
INNER JOIN [QuestionTags] QT
ON Q.[Question_Id] = QT.[Question_Id]
WHERE [Tag_Name] IN ('c#','.net',...)
AND [Creation_Date] < @EDTE
) Q
GROUP BY [DaysOpen_Count]
ORDER BY 1,2
How should I proceed to create this SQL query?
The end goal is a Stored PRocedure that inputs a CSV-string of tags and spits out the past six month's page views for those tags.
UPDATE After "earning" the tumbleweed badge, I figured it was time for a bounty!