views:

121

answers:

3
+1  A: 

Based on the additional data from your comments, I have to conclude that you don't have enough information in the table to reliably make this update. You might be able to do some more analysis and find an acceptable timespan value you can group on and use for a one-time fix using something like Charles Bretana's method, but you'll have do that analysis yourself and you won't be able to count on that number or any number to continue to work long-term.

Joel Coehoorn
ok tricky thing here is the landing page isn't always 'homepage'
Nai
In that case, you really want a reliable way to signify the home page. Otherwise you have to rely on grouping by the timestamps, and that _won't_ work. You could pick something nice like 30minutes, but eventually you'll have a few sessions that run for longer. If you stretch the time out to accommodate that you'll eventually have two users sessions close enough together that they overlap in your groups.
Joel Coehoorn
Hmm... even that may not be enough. Who says users don't hit the home page more than once in a session? But I do have an idea. Give me some time and I'll see if I can come up with something.
Joel Coehoorn
You are correct Joel. But it seems I do not have a choice with the time limitations. Thanks for your time man.
Nai
Nevermind: the other method I was exploring required the first page in a session to always be the home page. That may not be true. Given that, I must conclude that you don't have enough information stored in your table to _reliably_ make this update. You _might_ be able to fudge it with the dates, but over time it would eventually break.
Joel Coehoorn
A: 

Does this look right?

UPDATE T Set SessionGuid = 
    (Select Min(UserGuid) FROM [mytable] 
     Where IPAddress=T.IPAddress 
       And Date = (Select Min(Date) From MyTable
                   Where IPAddress = T.IPAddress
                       And DateDiff(minute, Date, T.Date) 
                              Between 0 and 30)) 
FROM [mytable] T
Charles Bretana
Its returning more than 1 value with your sub query and not liking it.
Nai
@Nai, this would only happen if there are multiple rows with same IPAddress and exact same datetime. But it is what it is as they say, so I edited sql to handle that. Try Again?
Charles Bretana
It's not combining the sessions correctly on a bigger dataset. Thanks for your effort.
Nai
+1  A: 

EDIT I had a number of syntax errors, since I didn't test. (Note to self, don't submit answers without test code.) And to top it off, once I fixed the syntax, and ran against the test data, I found that my answer was wrong. The subquery testing for existence of a row within thirty minutes prior was completely bogus.

Corrected version:

; with EliminateTies (UserGuid, SessionGuid, "Date") as
    (select UserGuid
        , cast(min(cast(SessionGuid as varbinary)) as uniqueidentifier)
        , "Date"
    from Sessions
    group by UserGuid, "Date")
, SessionBoundaries (UserGuid, SessionGuid, StartDateTime, SessionNumber) as
    (select UserGuid, SessionGuid, "Date"
        , row_number() over 
            (partition by UserGuid 
            order by "Date") as SessionNumber
    from (select UserGuid, SessionGuid, "Date" from EliminateTies
        union all
        -- Add a set of records at the end of time, to bound the last
        -- of each users sessions
        select distinct UserGuid
            , null as SessionGuid
            , cast('9999-12-31' as datetime) as "Date"
        from Sessions) ET_Out
    where not exists (select *
        from EliminateTies ET_In
        where ET_Out.UserGuid = ET_In.UserGuid
        and ET_Out.SessionGuid <> ET_In.SessionGuid
        and (dateadd(minute, -30, ET_Out."Date") < ET_In."Date" 
            and ET_In."Date" <= ET_Out."Date")))
Update MT
    set SessionGuid = LowBound.SessionGuid
from SessionBoundaries LowBound
inner join SessionBoundaries HighBound
    on LowBound.UserGuid = HighBound.UserGuid
    and LowBound.SessionNumber = HighBound.SessionNumber - 1
inner join Sessions S
    on S.UserGuid = LowBound.UserGuid
    and LowBound.StartDateTime <= S."Date" 
    and S."Date" < HighBound.StartDateTime

EDIT 2 Adding explanation:

  1. ; with EliminateTies ... Define a relation called EliminateTies. This is to handle the possibility that the triples (UserGuid, SessionGuid, "Date") may contain duplicates for a given (UserGuid, "Date"). SQL Server datetime has a resolution of 1/300 of a second, so a duplicate is unlikely, but not impossible.
    1. EliminateTies will contain a row for every (UserGuid, "Date") pair group by UserGuid, "Date".
    2. The SessionGuid is picked arbitrarily from the set of SessionGuids available for that pair by the agregate function min(). As pointed out in a comment, MIN(GUID) is not allowed, so a cast to VARBINARY is made, MIN() is found, and then a cast back to a UNIQUEIDENTIFIER.
    3. Notes:
      1. I start with a semicolon because while SQL Server normally does not require a semicolon at the end of a statement, it does in the case of a statement that precedes a with ... Instead of trying to retrain myself to end all statements with semicolons, I have trained myself to start all with ... using a semicolon. The semicolon actually terminates any previous statement.
      2. I am quoting the "Date" identifier because date is a datatype in SQL Server 2008 and hence a reserved word there. OP is on 2005, so not strictly nescessary.
  2. , SessionBoundaries Another relationship being defined. This relation has all the points where a new session starts based on the rule that for a given UserGuid sessions start when there is no entry in Sessions within the last 30 minutes.
    1. The inline view from (select ... union all ...) ET_Out Adds an extra row with a date at the "end of time" for each UserGuid. This is because we want to group the rows between a high and a low boundary, and we need an upper boundary for the last session of every user.
    2. where not exists filters out the rows that are not at the start of a session boundary.
    3. Within the select clause row_number() ... will number the rows, ordered by date, so that every UserGuid has SessionNumbers 1..N+1 where N is the number of sessions. We later use the SessionNumbers to find neighboring boundaries. (Could have been better called SessionBoundaryNumber.)
  3. And finally the update. We add a from clause to allowing joining relations together. The update specifies the table alias of the table we wish to update from the from clause. (Note this is T-SQL and not an ANSI standard way to use multiple relations together in a view.)
    1. First we self join SessionBoundary by equality on UserGuid and by being neighbor boundaries based on SessionNumber. If user A had 2 sessions, there would be 3 SessionNumbers 1, 2 & 3. This join would result in (Low, High) pairs of (1, 2) & (2, 3). Now for every UserGuid we have the lower and upper boundary of every session. The upper boundary of the last session is "end of time."
    2. Next we join to the table we want to update, Sessions The join condition is that UserGuid is equal and that the date of the row in Sessions is within the range defined by [LowBound, HighBound). Remember the Boundaries are where sessions start. So the test is that LowBound <= Sessions."Date" < HighBound.
Shannon Severance
ok gimme a second
Nai
1. Got rid of the MIN(SessionGUID) -- MIN/MAX not working on GUID2. Added SessionGUID to GROUP BYIt's now saying "Invalid object name 'ElminateTies' "
Nai
Adding SessionGUID to the group by doesn't help. The point of the EliminateTies is to handle the (improbable, but possibly possible) case of two rows with the same datetime for the same UserGuid. And then yes, I had a number of typos. Thank you for the test data.
Shannon Severance
very nice. bravo!
Nai
I'm having a bit of problem following your logic through. Do you think you could add some more comments throughout just so I can learn from it?
Nai