Based on the additional data from your comments, I have to conclude that you don't have enough information in the table to reliably make this update. You might be able to do some more analysis and find an acceptable timespan value you can group on and use for a one-time fix using something like Charles Bretana's method, but you'll have do that analysis yourself and you won't be able to count on that number or any number to continue to work long-term.
+1
A:
Joel Coehoorn
2009-09-01 16:02:44
ok tricky thing here is the landing page isn't always 'homepage'
Nai
2009-09-01 16:09:29
In that case, you really want a reliable way to signify the home page. Otherwise you have to rely on grouping by the timestamps, and that _won't_ work. You could pick something nice like 30minutes, but eventually you'll have a few sessions that run for longer. If you stretch the time out to accommodate that you'll eventually have two users sessions close enough together that they overlap in your groups.
Joel Coehoorn
2009-09-01 16:25:37
Hmm... even that may not be enough. Who says users don't hit the home page more than once in a session? But I do have an idea. Give me some time and I'll see if I can come up with something.
Joel Coehoorn
2009-09-01 16:26:23
You are correct Joel. But it seems I do not have a choice with the time limitations. Thanks for your time man.
Nai
2009-09-01 16:27:58
Nevermind: the other method I was exploring required the first page in a session to always be the home page. That may not be true. Given that, I must conclude that you don't have enough information stored in your table to _reliably_ make this update. You _might_ be able to fudge it with the dates, but over time it would eventually break.
Joel Coehoorn
2009-09-01 16:36:58
A:
Does this look right?
UPDATE T Set SessionGuid =
(Select Min(UserGuid) FROM [mytable]
Where IPAddress=T.IPAddress
And Date = (Select Min(Date) From MyTable
Where IPAddress = T.IPAddress
And DateDiff(minute, Date, T.Date)
Between 0 and 30))
FROM [mytable] T
Charles Bretana
2009-09-01 16:16:42
@Nai, this would only happen if there are multiple rows with same IPAddress and exact same datetime. But it is what it is as they say, so I edited sql to handle that. Try Again?
Charles Bretana
2009-09-01 18:37:22
It's not combining the sessions correctly on a bigger dataset. Thanks for your effort.
Nai
2009-09-03 09:59:26
+1
A:
EDIT I had a number of syntax errors, since I didn't test. (Note to self, don't submit answers without test code.) And to top it off, once I fixed the syntax, and ran against the test data, I found that my answer was wrong. The subquery testing for existence of a row within thirty minutes prior was completely bogus.
Corrected version:
; with EliminateTies (UserGuid, SessionGuid, "Date") as
(select UserGuid
, cast(min(cast(SessionGuid as varbinary)) as uniqueidentifier)
, "Date"
from Sessions
group by UserGuid, "Date")
, SessionBoundaries (UserGuid, SessionGuid, StartDateTime, SessionNumber) as
(select UserGuid, SessionGuid, "Date"
, row_number() over
(partition by UserGuid
order by "Date") as SessionNumber
from (select UserGuid, SessionGuid, "Date" from EliminateTies
union all
-- Add a set of records at the end of time, to bound the last
-- of each users sessions
select distinct UserGuid
, null as SessionGuid
, cast('9999-12-31' as datetime) as "Date"
from Sessions) ET_Out
where not exists (select *
from EliminateTies ET_In
where ET_Out.UserGuid = ET_In.UserGuid
and ET_Out.SessionGuid <> ET_In.SessionGuid
and (dateadd(minute, -30, ET_Out."Date") < ET_In."Date"
and ET_In."Date" <= ET_Out."Date")))
Update MT
set SessionGuid = LowBound.SessionGuid
from SessionBoundaries LowBound
inner join SessionBoundaries HighBound
on LowBound.UserGuid = HighBound.UserGuid
and LowBound.SessionNumber = HighBound.SessionNumber - 1
inner join Sessions S
on S.UserGuid = LowBound.UserGuid
and LowBound.StartDateTime <= S."Date"
and S."Date" < HighBound.StartDateTime
EDIT 2 Adding explanation:
; with EliminateTies ...
Define a relation called EliminateTies. This is to handle the possibility that the triples(UserGuid, SessionGuid, "Date")
may contain duplicates for a given(UserGuid, "Date")
. SQL Server datetime has a resolution of 1/300 of a second, so a duplicate is unlikely, but not impossible.- EliminateTies will contain a row for every
(UserGuid, "Date")
pairgroup by UserGuid, "Date"
. - The
SessionGuid
is picked arbitrarily from the set ofSessionGuid
s available for that pair by the agregate functionmin()
. As pointed out in a comment,MIN(GUID)
is not allowed, so a cast toVARBINARY
is made,MIN()
is found, and then a cast back to aUNIQUEIDENTIFIER
. - Notes:
- I start with a semicolon because while SQL Server normally does not require a semicolon at the end of a statement, it does in the case of a statement that precedes a
with ...
Instead of trying to retrain myself to end all statements with semicolons, I have trained myself to start allwith ...
using a semicolon. The semicolon actually terminates any previous statement. - I am quoting the "Date" identifier because date is a datatype in SQL Server 2008 and hence a reserved word there. OP is on 2005, so not strictly nescessary.
- I start with a semicolon because while SQL Server normally does not require a semicolon at the end of a statement, it does in the case of a statement that precedes a
- EliminateTies will contain a row for every
, SessionBoundaries
Another relationship being defined. This relation has all the points where a new session starts based on the rule that for a given UserGuid sessions start when there is no entry in Sessions within the last 30 minutes.- The inline view
from (select ... union all ...) ET_Out
Adds an extra row with a date at the "end of time" for each UserGuid. This is because we want to group the rows between a high and a low boundary, and we need an upper boundary for the last session of every user. where not exists
filters out the rows that are not at the start of a session boundary.- Within the select clause
row_number() ...
will number the rows, ordered by date, so that everyUserGuid
hasSessionNumber
s 1..N+1 where N is the number of sessions. We later use the SessionNumbers to find neighboring boundaries. (Could have been better calledSessionBoundaryNumber
.)
- The inline view
- And finally the update. We add a
from
clause to allowing joining relations together. The update specifies the table alias of the table we wish to update from the from clause. (Note this is T-SQL and not an ANSI standard way to use multiple relations together in a view.)- First we self join
SessionBoundary
by equality onUserGuid
and by being neighbor boundaries based onSessionNumber
. If user A had 2 sessions, there would be 3SessionNumber
s 1, 2 & 3. This join would result in (Low, High) pairs of (1, 2) & (2, 3). Now for everyUserGuid
we have the lower and upper boundary of every session. The upper boundary of the last session is "end of time." - Next we join to the table we want to update,
Sessions
The join condition is thatUserGuid
is equal and that the date of the row inSessions
is within the range defined by [LowBound, HighBound). Remember the Boundaries are where sessions start. So the test is that LowBound <= Sessions."Date" < HighBound.
- First we self join
Shannon Severance
2009-09-01 17:22:25
1. Got rid of the MIN(SessionGUID) -- MIN/MAX not working on GUID2. Added SessionGUID to GROUP BYIt's now saying "Invalid object name 'ElminateTies' "
Nai
2009-09-01 18:20:38
Adding SessionGUID to the group by doesn't help. The point of the EliminateTies is to handle the (improbable, but possibly possible) case of two rows with the same datetime for the same UserGuid. And then yes, I had a number of typos. Thank you for the test data.
Shannon Severance
2009-09-01 19:34:56
I'm having a bit of problem following your logic through. Do you think you could add some more comments throughout just so I can learn from it?
Nai
2009-09-03 10:02:54