views:

534

answers:

7

I have a table named visiting that looks like this:

id | visitor_id | visit_time 
-------------------------------------
 1 |          1 | 2009-01-06 08:45:02 
 2 |          1 | 2009-01-06 08:58:11
 3 |          1 | 2009-01-06 09:08:23 
 4 |          1 | 2009-01-06 21:55:23
 5 |          1 | 2009-01-06 22:03:35

I want to work out a sql that can get how many times a user visits within one session(successive visit's interval less than 1 hour).

So, for the example data, I want to get following result:

visitor_id | count
-------------------
         1 |     3
         1 |     2

BTW, I use postgresql 8.3. Thanks!

UPDATE: updated the timestamps in the example data table. sorry for the confusion.
UPDATE: I don't care much if the solution is a single sql query, using store procedure, subquery etc. I only care how to get it done :)

A: 

One or both of these may work? However, both will end up giving you more columns in the result than you are asking for.

SELECT visitor_id,
       date_part('year', visit_time),
       date_part('month', visit_time),
       date_part('day', visit_time),
       date_part('hour', visit_time),
       COUNT(*)
  FROM visiting
 GROUP BY 1, 2, 3, 4, 5;


SELECT visitor_id,
       EXTRACT(EPOCH FROM visit_time)-(EXTRACT(EPOCH FROM visit_time) % 3600),
       COUNT(*)
  FROM visiting
 GROUP BY 1, 2;
Beau Simensen
I won't claim that any of these are very efficient, but I believe they should get the job done.
Beau Simensen
+1  A: 

The question is slightly ambiguous because you're making the assumption or requiring that the hours are going to start at a set point, i.e. a natural query would also indicate that there's a result record of (1,2) for all the visits between the hour of 08:58 and 09:58. You would have to "tell" your query that the start times are for some determinable reason visits 1 and 4, or you'd get the natural result set:

visitor_id | count 
--------------------
         1 | 3
         1 | 2 <- extra result starting at visit 2
         1 | 1 <- extra result starting at visit 3
         1 | 2
         1 | 1 <- extra result starting at visit 5

That extra logic is going to be expensive and too complicated for my fragile mind this morning, somebody better than me at postgres can probably solve this.

I would normally want to solve this by having a sessionkey column in the table I could cheaply group by for perforamnce reasons, but there's also a logical problem I think. Deriving session info from timings seems dangerous to me because I don't believe that the user will be definitely logged out after an hours activity. Most session systems work by expiring the session after a period of inactivity, i.e. it's very likely that a visit after 9:45 is going to be in the same session because your hourly period is going to be reset at 9:08.

annakata
Yeah, "within an hour" is sort of difficult to account for, but based on the example given it looks like "on the hour" is what they are looking for.
Beau Simensen
well frankly I think extracting sessions from timestamps is a losing proposition, hence my sessionkey column statement
annakata
"on the hour" is not what I want, I just modified the example data.
Vincent Wang
that's not quite the problem, answer edited to clarify
annakata
A: 

This can't be done in a single SQL. The better option is to handle it in stored procedure

Dheer
A: 

If it were T-SQL, I would write something as:

SELECT  visitor_id, COUNT(id), 
     DATEPART(yy, visit_time), DATEPART(m, visit_time), 
     DATEPART(d, visit_time), DATEPART(hh, visit_time)
FROM visiting
GROUP BY
    visitor_id, 
    DATEPART(yy, visit_time), DATEPART(m, visit_time), 
    DATEPART(d, visit_time), DATEPART(hh, visit_time)

which gives me:

1   3 2009 1 6 8
1   2 2009 1 6 21

I do not know how or if you can write this in postgre though.

Recep
+1  A: 

The problem seems a little fuzzy.

It gets more complicated as id 3 is within an hour of id 1 and 2, but if the user had visited at 9:50 then that would have been within an hour of 2 but not 1.

You seem to be after a smoothed total - for a given visit, how many visits are within the following hour?

Perhaps you should be asking for how many visits have a succeeding visit less than an hour distant? If a visit is less than an hour from the preceeding one then should it 'count'?

So what you probably want is how many chains do you have where the links are less than an arbitrary amount (so the hypothetical 9:50 visit would be included in the chain that starts with id 1).

Unsliced
yeah, you got my point, that is just what I want. It seems difficult to do with sql, I am considering the *sessionkey* method mentioned by annakata ...
Vincent Wang
unsliced is better at description than I am :) - yeah I think it's far better to seed the database with the data you need if you can than derive the data after the fact
annakata
+1  A: 

no simple solution

There is no way to do this in a single SQL statment.
Below are 2 ideas: one uses a loop to count visits, the other changes the way the visiting table is populated.

loop solution

However, it can be done without too much trouble with a loop.
(I have tried to get the postgresql syntax correct, but I'm no expert)

/* find entries where there is no previous entry for */ 
/* the same visitor within the previous hour:        */ 

select v1.* , 0 visits 
into temp_table
from visiting v1
where not exists ( select 1 
                   from   visiting v2
                   where  v2.visitor_id = v1.visitor_id 
                   and    v2.visit_time < v1.visit_time 
                   and    v1.visit_time - interval '1 hour' <     v2.visit_time 
                 )  
select @rows = @@rowcount 

while @rows > 0 
begin
    update temp_table
    set    visits = visits + 1 , 
           last_time = v.visit_time 
    from   temp_table t , 
           visiting   v 
    where  t.visitor_id = v.visitor_id 
    and    v.visit_time - interval '1 hour' < t.last_time
    and    not exists ( select 1 
                        from   visiting v2 
                        where  v2.visitor_id = t.visitor_id 
                        and    v2.visit_time between t.last_time and v.visit_time 
                      ) 

    select @rows = @@rowcount 
end

/* get the result: */ 

select visitor_id, 
       visits 
from temp_table

The idea here is to do this:

  • get all visits where there is no prior visit inside of an hour.
    • this identifies the sessions
  • loop, getting the next visit for each of these "first visits"
    • until there are no more "next visits"
  • now you can just read off the number of visits in each session.

best solution?

I suggest:

  • add a column to the visiting table: session_id int not null
  • change the process which makes the entries so that it checks to see if the previous visit by the current visitor was less than an hour ago. If so, it sets session_id to the same as the session id for that earlier visit. If not, it generates a new session_id .
  • you could put this logic in a trigger.

Then your original query can be solved by:

SELECT session_id, visitor_id, count(*)
FROM   visiting 
GROUP BY session_id, visitor_id

Hope this helps. If I've made mistakes (I'm sure I have), leave a comment and I'll correct it.

AJ
Because the sql query of the first method is complex and hard to maintain and extend, I prefer the second method. Thanks!
Vincent Wang
+1  A: 

PostgreSQL 8.4 will have a windowing function, by then we can eliminate creating temporary table just to simulate rownumbers (sequence purposes)

create table visit
(
visitor_id int not null,
visit_time timestamp not null
);




insert into visit(visitor_id, visit_time) 
values
(1, '2009-01-06 08:45:02'),
(2, '2009-02-06 08:58:11'),
(1, '2009-01-06 08:58:11'),
(1, '2009-01-06 09:08:23'),
(1, '2009-01-06 21:55:23'),
(2, '2009-02-06 08:59:11'),
(2, '2009-02-07 00:01:00'),
(1, '2009-01-06 22:03:35');




create temp table temp_visit(visitor_id int not null, sequence serial not null, visit_time timestamp not null);
insert into temp_visit(visitor_id, visit_time) select visitor_id, visit_time from visit order by visitor_id, visit_time;


select 
 reference.visitor_id, count(nullif(reference.visit_time - prev.visit_time < interval '1 hour',false))
from temp_visit reference
left join temp_visit prev 
on prev.visitor_id = reference.visitor_id and prev.sequence = reference.sequence - 1
group by reference.visitor_id;
Michael Buen