views:

246

answers:

3

Using the Stack Overflow public data dump, I've created three simple tables:

  • Questions (Question_Id, View_Count, Creation_Date)
  • Tags (Tag_Name)
  • QuestionTags (Question_Id, Tag_Name)

The Questions table has hundreds of thousands of rows with Creation_Date spanning from a year ago to today. Looking over the data, there are two notable trends:

  • Number of Questions Increase by Period - for example, there were more questions asked this month than three months ago
  • Questions Views have a Long Tail - by looking at views based on weeks open, we can see that most of a question's views occur in the first week; a lesser amount in the second and third; and a long, constant tail in the following weeks

If neither of these factors came in to play, it'd be fairly trivial to estimate traffic for a given tag (or group of tags) over a month:

SELECT YEAR(Q.Creation_Date)
      ,MONTH(Q.Creation_Date)
      ,SUM( Q.View_Count / DATEDIFF(m,Q.Creation_Date,GETDATE()) )
  FROM Questions Q
       JOIN QuestionTags QT
         ON Q.Question_Id = QT.Question_Id
 WHERE QT.Tag_Name IN ('c#','.net', ... )
GROUP BY YEAR(Q.Creation_Date), MONTH(Q.Creation_Date)
ORDER BY 1,2

But because of aforementioned factors (especially the long tail), I'm not sure how to approximate views. My thoughts are to create a function that, using the long tail formula, will calculates views for a month based on the current number of views and weeks open.

Here's what I came up with to find the tail:

DECLARE @SDTE DATETIME, @EDTE DATETIME
SELECT @SDTE = '2009-01-11' -- after new years holiday
      ,@EDTE = CAST( MAX([Creation_Date])  AS INT )
  FROM [Questions]


SELECT [DaysOpen_Count]
      ,AVG( [WView_Count] ) 
       FROM
  (
  SELECT QT.[Tag_Name], 
      Q.[View_Count],
      [DaysOpen_Count] = DATEDIFF(DAY, Q.[Creation_Date], @EDTE),
      [WView_Count] = CAST( Q.[View_Count] / ( DATEDIFF(DAY, Q.[Creation_Date], @EDTE) / 7.0 ) AS INT )
    FROM [Questions] Q
      INNER JOIN [QuestionTags] QT
        ON Q.[Question_Id] = QT.[Question_Id]
   WHERE [Tag_Name] IN ('c#','.net',...)
     AND [Creation_Date] < @EDTE
  ) Q
GROUP BY [DaysOpen_Count]
ORDER BY 1,2

How should I proceed to create this SQL query?

The end goal is a Stored PRocedure that inputs a CSV-string of tags and spits out the past six month's page views for those tags.

UPDATE After "earning" the tumbleweed badge, I figured it was time for a bounty!

+2  A: 

Hello,

I thought of this method for estimating the tail:

for a list of tags, for each question in these tags give the 1st month after creation_date 80% of View_Count give the 2nd month after creation_date 10% of View_Count split 10% equally between the remaining months until today

of course 80%, 10% is just a choice of mine, they can be calculated more precisely based on real data. Also, the second month 10% may be eliminated. All that logic is in the: CASE WHEN diff ... part.

you obtain estimated view_count/question/month

then all you have to do is sum view_count per month and if you want a window of time add a condition on the month

I created a stored procedure that can do this, but you have to create first a temporary table #tags(Tag_name) where you put the desired tags.

CREATE PROCEDURE GetTagViews @startDate datetime, @endDate datetime
As

IF exists (SELECT null FROM sysobjects WHERE name = '#months' and type = 'U')
    DROP TABLE #MONTHS

CREATE TABLE #MONTHS
(
    month datetime
)

DECLARE @currMonth datetime
SELECT @currMonth = MIN(Creation_Date) FROM Questions

-- Populate #MONTHS with all the months from the oldest
-- question creation_date to Today
WHILE @currMonth < getdate()
BEGIN
    -- insert date starting at the beginning og the month
    INSERT INTO #MONTHS select @currMonth - day(@currMonth) + 1
    SELECT @currMonth = dateadd(m, 1, @currMonth) -- advance 1 month
END

SELECT YEAR(month) y, MONTH(month) m, SUM(curr_month_views) Views FROM (
SELECT Q1.month, Q1.diff, round(
CASE WHEN diff = dmin and diff = dmax THEN View_Count
     WHEN diff = dmin and diff < dmax THEN 0.8*View_Count
     WHEN diff = dmin+1 and diff < dmax THEN 0.1*View_Count
     WHEN diff = dmin+1 and diff = dmax THEN 0.2*View_Count
     WHEN diff >= dmin+2 THEN 0.1/(dmax - (dmin+2) + 1)*View_Count
     ELSE 0
END, 0) curr_month_views
FROM (
SELECT  Q.question_id, m.month, 
    DATEDIFF(m, Q.Creation_Date, m.month) diff, 
    Q.View_Count, dmin, dmax
FROM Questions Q,
     #MONTHS m, 
     (SELECT MIN(DATEDIFF(m, Q.Creation_Date, m.month)) [dmin], 
             MAX(DATEDIFF(m, Q.Creation_Date, m.month)) [dmax]
        FROM Questions Q,#MONTHS m
       WHERE DATEDIFF(m, Q.Creation_Date, m.month) >= 0) MINMAX
) Q1 join QuestionTags QT on Q1.question_id = QT.question_id 
     join #tags on #tags.Tag_Name = QT.Tag_Name
) b WHERE month >= @startDate - day(@startDate) + 1
      AND month <= @enddate - day(@enddate) + 1
GROUP BY Year(month), Month(month)
ORDER BY 1, 2

If I run this procedure with the following data:

Question_Id View_Count  Creation_Date                  tag_name         
----------- ----------- ------------------------------ ----------
0           42          2009-09-10 00:00:00.000        sql
1           326         2008-08-04 00:00:00.000        sql
2           377         2008-08-04 00:00:00.000        sql
3           568         2008-08-03 00:00:00.000        sql
4           839         2008-08-01 00:00:00.000        sql
5           228         2009-03-01 00:00:00.000        sql
6           178         2009-03-11 00:00:00.000        sql
7           348         2009-08-11 00:00:00.000        c#

populate #tags with 'sql'

GetTagViews '20090501', '20091001'

    y           m           Views                                    
    ----------- ----------- ---------------
    2009        5           21.000000000000
    2009        6           21.000000000000
    2009        7           21.000000000000
    2009        8           21.000000000000
    2009        9           55.000000000000

populate #tags with 'c#'

GetTagViews '20090501', '20091001'

    y           m           Views                                    
    ----------- ----------- ---------------------------------------- 
    2009        5           .000000000000
    2009        6           .000000000000
    2009        7           .000000000000
    2009        8           278.000000000000
    2009        9           35.000000000000

populate #tags with both 'sql' & 'c#'

GetTagViews '20090501', '20091001'

    y           m           Views                                    
    ----------- ----------- ----------------
    2009        5           21.000000000000
    2009        6           21.000000000000
    2009        7           21.000000000000
    2009        8           299.000000000000
    2009        9           90.000000000000

(you see that peak for (sql, c#) comparing to only (sql) for 2009-08, it's due to the c# question being asked that month.)

N.B.: the rouding of estimates my lead to a difference of some views (~1) if you sum up detailed views and compare to the original data for a given question!

najmeddine
this is very nice, thanks; i'm going to give it a shot
Alex Papadimoulis
+5  A: 

You'll need to consider an exponential Views decay curve, something similar to this - http://en.wikipedia.org/wiki/Exponential%5Fdecay

What we need here is the area under the curve upto desired time (in days).

If you do the math, you'll come to a result

Views = V/λ[1 - e^(-λt)]

t is (date created - today's date - 1)

V is the view count we have

λ can be 2ln2/T or 1.4/T

T can be a major life time like 5 days or 7 days. Lets take it 5.

We're making a lot of assumptions here cuz of the dynamic nature of SO. But I'm positive that it yields nice results.

All you have to do now is substitute the appropriate values and get views.

Arpit Tambi
This looks very close to what I was looking for; as for dynamic nature... it all averages out, especially once we drop the "freak" questions like http://stackoverflow.com/questions/84556/whats-your-favorite-programmer-cartoon
Alex Papadimoulis
You may have to fine tune the formula, especially the V and decay rate λ. Decay rate for such questions is very low (1/25 ?).
Arpit Tambi
A: 

To emulate long tails just introduce a constant. Or use logarithmic function.

your_formula(delta_t) + C

1 / (1 + log(1 + delta_t))

(coefficients are omitted)

serge_bg