Web server log analyzers (e.g. Urchin) often display a number of "sessions". A session is defined as a series of page visits / clicks made by an individual within a limited, continuous time segment. The attempt is made to identify these segments using IP addresses, and often supplementary info like user agent and OS, along with a session timeout threshold such as 15 or 30 minutes.
For certain web sites and applications, a user can be logged in and/or tracked with a cookie, which means the server can precisely know when a session begins. I'm not talking about that, but about inferring sessions heuristically ("session reconstruction") when the web server does not track them.
I could write some code e.g. in Python to try to reconstruct sessions based on the criteria mentioned above, but I'd rather not reinvent the wheel. I'm looking at log files of a size around 400K lines, so I'd have to be careful to use a scalable algorithm.
My goal here is to extract a list of unique IP addresses from a log file, and for each IP address, to have the number of sessions inferred from that log. Absolute precision and accuracy are not necessary... pretty-good estimates are ok.
Based on this description:
a new request is put in an existing session if two conditions are valid:
- the IP address and the user-agent are the same of the requests already
inserted in the session,- the request is done less than fifteen minutes after the last request inserted.
it would be simple in theory to write a Python program to build up a dictionary (keyed by IP) of dictionaries (keyed by user-agent) whose value is a pair: (number of sessions, latest request of latest session).
But I would rather try to use an existing implementation if one's available, since I might otherwise risk spending a lot of time tuning performance.
FYI lest someone ask for sample input, here is a line of our log file (sanitized):
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0