views:

61

answers:

2

Hello,

I need to determine if user has already visited a page, for tracking unique page views.

I have already implemented some HTTP header cache, but now I need to optimize the SQL queries.

The visit is unique, when:

  • pair: page_id + user_id is found in the visit table
  • or pair: page_id + session_id is found
  • or: page_id + [ip + useragent] - (this is a topic for another discussion, whether it should be only ip or ip+useragent)

So I have a table tracking user visits:

visit:
    page_id
    user_id
    session_id
    useragent
    ip
    created_at
    updated_at

Now on each user visit (which does not hit cache) I will update a row if it exists. If there are any affected rows, I will insert new visit to the table.

This are one or two queries (assuming the cache will work, mostly two queries), but the number of rows is limited somehow. Maybe it would be better to store all the visits and then clean up the database within e.g. a month?

The questions are:

  • how should be the visit table constructed (keys, indexes, relations to user and page_views table). Some of the important fields may be null (e.g. user_id), what about indexes then? Do I need a multi column primary key?
  • which would be the fastest sql query to find the unique user?
  • is this sane approach?

I use PostgreSQL and PDO (Doctrine ORM). All my sessions are stored in the same DB.

A: 

Just some random thoughts:

Can I verify that the thinking behind the unique visit types is:

  1. pageid + userid = user has logged in
  2. pageid + sessionid = user not identified but has cookies enabled
  3. pageid + ip / useragent = user not identified and no cookies enabled

For raw performance, you might consider #2 to be redundant since #3 will probably cover #2 i most conditions (or is #2 important e.g. if the user then registers and then #2 can be mapped to a #1)? (meaning that session id might still be logged, but not used in any visit determination)

IMHO IP will always be present (even if spoofed) and will be a good candidate for an Index. User agent can be hidden and will only have a limited range (not very selectable).

I would use a surrogate primary key in this instance due to the nullable fields and since none of the fields is unique by themselves.

IMHO your idea about storing ALL the visits and then trimming the duplicates via batch out is a good one to weigh up (rather than checking if exists to update vs insert new)

  • So PK = Surrogate
  • Clustering = Not sure - another query / requirement might drive this better.
  • NonClustered Index = IP Address, Page Id (assuming more distinct IP addresses than page id's)
nonnb
+1  A: 

Personally I would not put this in the request-response path. I would log the the raw data in a table (or push it on a queue) and let a background task/thread/cron job deal with that.

The queue (or the message passing table) should then just contain pageid, userip, sessionid, useragen,ip.

Absolute timings are less important now as long as the background task can keep up. since a single thread will now do the heavy lifting it will not create conflicting locks when updating the unique pageviews tables.

Peter Tillemans