views:

93

answers:

3

Lets say I have a database table called "Scrape" possibly setup like:

UserID (int)   
UserName (varchar)  
Wins (int)   
Losses (int)  
ScrapeDate (datetime)

I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table.

How can I query a list of users sorted by wins/losses, but only taking into consideration the most recent entry (ScrapeDate)?

Also, do you think it matters that people will be hitting the site and the scrape may possibly be in the middle of completing?

For example I could have:

1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09  
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09  
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09

Where, this represents a scrape that has only updated Bob so far, and is in the process of updating Frank but has yet to be inserted. How would you handle this situation as well?

So, my question is:

  1. How would you handle querying only the most recent scrape of each user to determine the rankings
  2. Do you think the fact that the database may be in a state of updating (especially if a scrape could take up to 1 day to complete), and not all users have completely updated yet matters? If so, how would you handle this?

Thank you, and thank you for your responses you have given me on my related question:

http://stackoverflow.com/questions/1397794/when-scraping-a-lot-of-stats-from-a-webpage-how-often-should-i-insert-the-collec

A: 

The answer to part one of your question depends on the version of SQL server you are using - SQL 2005+ offers ranking functions which make this kind of query a bit simpler than in SQL 2000 and before. I'll update this with more detail if you will indicate which platform you're using.

I suspect the clearest way to handle part 2 is to display the stats for the latest complete scraping exercise, otherwise you aren't showing a time-consistent ranking (although, if your data collection exercise takes 24 hours, there's a certain amount of latitude already).

To simplify this, you could create a table to hold metadata about each scrape operation, giving each one an id, start date and completion date (at a minimum), and display those records which relate to the latest complete scrape. To make this easier, you could remove the "scrape date" from the data collection table, and replace it with a foreign key linking each data row to a row in the scrape table.

EDIT

The following code illustrates how to rank users by their latest score, regardless of whether they are time-consistent:

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeDate datetime
)

INSERT #scrape
      select 'Alice',100,200,'20090101'
union select 'Alice',120,210,'20090201'
union select 'Bob'  ,200,200,'20090101'
union select 'Clara',300,100,'20090101'
union select 'Clara',300,210,'20090201'
union select 'Dave' ,100,10 ,'20090101'


;with latestScrapeCTE
AS
(
        SELECT *
               ,ROW_NUMBER() OVER (PARTITION BY userName
                                   ORDER BY scrapeDate desc
                                  ) AS rn
               ,wins + losses AS totalPlayed
               ,wins - losses as winDiff
        from #scrape
)
SELECT userName
       ,wins
       ,losses
       ,scrapeDate
       ,winDiff
       ,totalPlayed
       ,RANK() OVER (ORDER BY winDiff desc
                              ,totalPlayed desc
                    ) as rankPos
FROM latestScrapeCTE
WHERE rn = 1
ORDER BY rankPos

EDIT 2

An illustration of the use of a metadata table to select the latest complete scrape:

create table #scrape_run
(runID int identity
,startDate datetime
,completedDate datetime
)

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeRunID int
)


INSERT #scrape_run
select '20090101', '20090102'
union select '20090201', null --null completion date indicates that the scrape is not complete

INSERT #scrape
      select 'Alice',100,200,1
union select 'Alice',120,210,2
union select 'Bob'  ,200,200,1
union select 'Clara',300,100,1
union select 'Clara',300,210,2
union select 'Dave' ,100,10 ,1


;with latestScrapeCTE
AS
(
        SELECT TOP 1 runID
                     ,startDate
        FROM #scrape_run
        WHERE completedDate IS NOT NULL
)
SELECT userName
       ,wins
       ,losses
       ,startDate     AS scrapeDate
       ,wins - losses AS winDiff
       ,wins + losses AS totalPlayed
       ,RANK() OVER (ORDER BY (wins - losses)  desc
                              ,(wins + losses) desc
                    ) as rankPos
FROM #scrape
JOIN latestScrapeCTE
ON   runID = scrapeRunID
ORDER BY rankPos
Ed Harper
Assume SQL 2005
Fatal510
A: 

Try something like:

  1. Select user id and max date of last entry for each user.
  2. Select and order records to get ranking based on above query results.

This should work, however depends on your database size.

DECLARE 
    @last_entries TABLE(id int, dte datetime)

-- insert date (dte) of last entry for each user (id)
INSERT INTO
    @last_entries (id, dte)
SELECT
    UserID,
    MAX(ScrapeDate)
FROM
    Scrape WITH (NOLOCK)
GROUP BY
    UserID

-- select ranking
SELECT
    -- optionally you can use RANK OVER() function to get rank value
    UserName,
    Wins,
    Losses
FROM
    @last_entries
    JOIN
     Scraps WITH (NOLOCK)
    ON
     UserID = id
     AND ScrapeDate = dte
ORDER BY
    Winds,
    Losses

I do not test this code, so it could not compile on first run.

Grzegorz Gierlik
+3  A: 

This is what I call the "greatest-n-per-group" problem. It comes up several times per week on StackOverflow.

I solve this type of problem using an outer join technique:

SELECT s1.*, s1.wins / s1.losses AS win_loss_ratio
FROM Scrape s1
LEFT OUTER JOIN Scrape s2
  ON (s1.username = s2.username AND s1.ScrapeDate < s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;

This will return only one row for each username -- the row with the greatest value in the ScrapeDate column. That's what the outer join is for, to try to match s1 with some other row s2 with the same username and a greater date. If there is no such row, the outer join returns NULL for all columns of s2, and then we know s1 corresponds to the row with the greatest date for that given username.

This should also work when you have a partially-completed scrape in progress.

This technique isn't necessarily as speedy as the CTE and RANKING solutions other answers have given. You should try both and see what works better for you. The reason I prefer my solution is that it works in any flavor of SQL.

Bill Karwin