views:

291

answers:

6

Take my profile for example, or any question number of views on this site, what is the process of logging the number of visits per page or object on a website, which I presumably think includes:

  • Counting registered users once (this must be reflected in the db, which pages / objects the user has visited). this will also not include unregistered users
  • IP: log the visit of each IP per page / object; this could be troublesome as you might have 2 different people checking the same website; or you really do want to track repeat visitors.
  • Cookie: this will probably result in that people with multiple computers would be counted twice
  • other method goes here ....

The question is, what is the process and best practice to count user requests?

EDIT

I've added the computer languages to the list of tags as they are of interest to me. Feel free to include any libraries, modules, and/or extensions that achieve the task.

The question could be rephrased into:

  • How does someone go about measuring the number of imprints when a user goes on a page? The question is not intended to be similar to what Google analytics does, rather it should be something similar to when you click on a stackoverflow question or profile and see the number of views.
A: 

I will try and implement pixel tracking to track views on your page/object. This method is used by google (google analytics) and other high profile media companies.

David Bonnici
I can give you more help on this if you are interested
David Bonnici
+1  A: 

Use a database to keep a record of the unique IPs (if the IP doesn't exist in the DB, create it, otherwise continue as planned) and then query the database for the number of those entities. Index this with IP and URL to store views for individual pages. You wont have to worry about tracking registered users this way, they will be totaled into the unique IP count. As far as multiple people from one IP, there's not much you can do there short of requiring an account and counting user->to->page-views similarly.

wybiral
+3  A: 

The best practice for a hit counter depends on how much traffic you expect your site to receive. As wybiral suggested, you can implement something that writes to a database after every request. This might include the IP address if you want to count unique visitors, or it could be a simple as just incrementing a running total for each page or for each (page, user) pair.

But that requires a database write for every request, even if you just want to serve a static page. Ideally speaking, a scalable web app should serve as much as possible from an in-memory cache. Database or disk I/O should be avoided as much as possible.

So the ideal set up would be to build up some representation of the server's activity in-memory and then occasionally (say every 15 minutes) write those events to the database. You could conceivably queue up thousands of requests and then store them with a single database write.

There's a tutorial describing how to do exactly this in python using Celery and Carrot: http://packages.python.org/celery/tutorials/clickcounter.html. It also includes some examples of how to set up your database tables using Django models and what code to call whenever someone accesses a page.

This tutorial will certainly be helpful to you regardless of what you choose to implement, although this level of architecture might be overkill if you don't expect thousands of hits each hour.

AndrewF
A: 

Pixel tracking will be fine, since you can have point the trackingpixel to a HttpHandler specific for that purpose. That way you can seperate the load and even use some kind of queue for highload scenarios.

Also, you can incorporate user specific information in the tracking pixel such as WHO has visited the page.

eg:

<a href="fakeimages/imba.gif?uid=123&info2=a&info3=b" style="height:1px;width:1px;" />

Then you need to handle the request going to fakeimages/*.gif with a specific HttpHandler / php redirect/controller (whatever language you're using) and process the infos.

regards

ovm
+6  A: 

The "correct" answer varies according to the situation; primarily the most desired statistic and the availability of resources to gather and process them: eg:

Server Side

Raw web server logs

All webservers have some facility to log requests. The trouble with them is that it requires a lot of processing to get meaningful data out and, for your example scenario, they won't record application specific details; like whether or not the request was associated with a registered user.

This option won't work for what you're interested in.

File based application logs

The application programmer can apply custom code to the application to record the stuff you're most interested in to a log file. This is similiar to the webserver log; except that it can be application aware and record things like the member making the request.

The programmers may also need to build scripts which extract from these logs the stuff you're most interested. This option might be suited to a high traffic site with lots of disk space and sysadmins who know how to ensure the logs get rotated and pruned from the production servers before bad things happen.

Database based application logs

The application programmer can write custom code for the application which records every request in a database. This makes it relatively easy to run reports and makes the data instantly accessible. This solution incurs more system overhead at the time of each request so better suited to lesser traffic sites, or scenarios where the data is highly valued.

Client Side

Javascript postback

This is a consideration on top of the above options. Google analytics does this.

Each page includes some javascript code which tells the client to report back to the webserver that the page was viewed. The data might be recorded in a database, or written to file.

Has an strong advantage of improving accuracy in scenarios where impressions get lost due to heavy caching/proxying between the client and server.

Cookies

Every time a request is received from someone who doesn't present a cookie then you assume they're new and record that hit as 'anonymous' and return a uniquely identifying cookie after they login. It depends on your application as to how accurate this proves. Some applications don't lend themselves to caching so it will be quite accurate; others (high traffic) encourage caching which will reduce the accuracy. Obviously it's not much use till they re-authenticate whenever they switch browsers/location.

What's most interesting to you?

Then there's the question of what statistics are important to you. For example, in some situations you're keen to know:

  • how many times a page was viewed, period,
  • how many times a page was viewed, by a known user
  • how many of your known users have viewed a specific page

Thence you typically want to break it down into periods of time to see trending. Respectively:

  • are we getting more views from random people?
  • or we getting more views from registered users?
  • or has pretty much every one who is going to see the page now seen it?

So back to your question: best practice for "number of imprints when a user goes on a page"?

It depends on your application.

My guess is that you're best off with a database backed application which records what is most interesting to your application and uses cookies to trace the member's sessions.

John Mee
+1 for a really nice answer. One thing that is good to consider when dealing with log files (especially files that can get quite large) is to write scripts that can extract the important information from a log file for a day, week, or month's worth of logs to avoid having to keep a lot of old log files around.
GWW
+1  A: 

I would suggest using a persistent key/value store like Redis. If you use a list with the list key being the serialized identifier, you can store other serialized entries and use llen to find the list size.

Example (python) after initializing your Redis store:

def intializeAndPush(serializedKey, serializedValue):
    if not redisStore.exists(serializedKey):
        redisStore.push(serializedKey, serializedValue)
    else:
        if serializedValue not in redisStore.lrange(serializedKey, 0, -1):
            redisStore.push(serializedKey, serializedValue)

def getSizeOf(serializedKey):
    if redisStore.exists(serializedKey):
        return redisStore.llen(serializedKey)
    else:
        return 0

Using this technique, you can use anything as serializedKey or serializedValue. If you want to store IPs with today's date or serialized login information, both are just as simple. Also, only unique serializedValues are stored since writes are locked on read (at least as I recall).

Scott