views:

78

answers:

6

I have a file with a sequence of event timestamps corresponding to the times at which someone visits a website:

02.02.2010 09:00:00
02.02.2010 09:00:00
02.02.2010 09:00:00
02.02.2010 09:00:01
02.02.2010 09:00:03
02.02.2010 09:00:05
02.02.2010 09:00:06
02.02.2010 09:00:06
02.02.2010 09:00:09
02.02.2010 09:00:11
02.02.2010 09:00:11
02.02.2010 09:00:11

etc, for several thousand rows.

I'd like to get an idea how the web hits are distributed over time, over the week etc. I need to know how I should scale the (future) web servers in order to guarantee service availability with a given number of nines. In particuler I need to give upper bounds on the number of almost-concurrent visits.

Are there any resources out ther that explain how to do that? I'm fluent in mathematics and statistics, and I've looked at queuing theory but it seems that that theory assumes the rate of arrival to be independent of the time of the day, which is clearly wrong in my case. And NO, histograms are not the right answer since the result depends heavily on bin width and placement.

A: 

Well, prepare yourself for a whole lot of 'what's wrong with AWStats/Webalizer/Analog-Stats/favourite-http-log-stats-viewer-of-the-month' responses...

They all do histograms, but thats because they are designed to help give a broad-at-a-glance picture of visitor traffic.

I recommend that you take a look at Splunk to see if it meets your requirements.

ChrisGNZ
A: 

If you don't want to use a histogram, could you just graph the kernel density?

Jack
A: 

Can nearly concurrent visits be defined or approximated as those that occur in the same second? If yes, here is how I would proceed:

  1. For each second in the data calculate the number of visits. This will include some seconds with 0 visits - don't exclude them.
  2. It is probably reasonable to assume that the number of visits per second has a Poisson distribution with a rate that changes over the day, and perhaps over the week. So decide what are the relevant predictors (time of day, day of the week, month?) and use Poisson regression to model the counts. You can use splines for the continuous variables (eg time of day), I believe there are even some "cyclic" splines that can take into account that 11:58PM is close to 00:02 AM. Or you can cut time into smaller discrete pieces, say 10 minute intervals. If you want to be really fancy, incorporate autocorrelation and overdispersion in the model.
  3. Based on the fitted model, you can estimate whatever percentile you want.

Of course, this is pretty fancy statistically, and you have to know what you are doing, but I think it could work.

Aniko
+2  A: 

You can always place a more flexible model on the arrive rate parameter. For instance, make the arrive rate a function of time, or place some time-series style model on it. Whatever makes sense for your data. The literature typically focuses on the core model because extensions are application specific.

In an extended model, you'll almost certainly want to use Bayesian methods. You are interested in the posterior predictive distribution of the object "almost concurrent events." A recent paper in JASA describes nearly your exact problem, applied to call center data:

For a quick solution, don't underestimate the power of histogram style estimators. They are simple nonparametric estimators and you can cross-validate tuning parameters like binwidth and placement. Theoretically this is somewhat unsatisfying, but it would take a day to implement. A fully Bayesian approach likely will dominate, but at significant computational cost.

Tristan
A: 

You're right, most of the theory assumes a Poisson distribution of hits, which you don't have because the rate of hits varies with time of day. However, couldn't you stratify your data into, say, one block for each hour of the day and assume that within a single hour the distribution of hits per second/minute/whatever unit is approximately Poisson? There are probably better ways (from a theoretical perspective), but this way has the advantage of being simple to implement and simple to explain to anyone with any statistical background.

dsimcha
A: 

I think you could argue that your hits are distributed according to a poisson distribution where the average and variation vary with the time of day.

To get a good idea of the peak load I'd start with just a scatterplot with the time of the hit on the horizontal axis and the time between that hit and the next hit on the vertical axis.

This should give you a good idea of the height and duration of your peaks. Then you can estimate the parameters of the poisson distribution for a sliding window of a length similar to that duration for every moment of the day. Sort of like a moving average. The area's where mean and variance are lowest will give you a good basis for estimating expected future peak load.

jilles de wit