ansaurus

Question

How should I analyze web traffic in a statistically correct way?

Answer 1

A:

Well, prepare yourself for a whole lot of 'what's wrong with AWStats/Webalizer/Analog-Stats/favourite-http-log-stats-viewer-of-the-month' responses...

They all do histograms, but thats because they are designed to help give a broad-at-a-glance picture of visitor traffic.

I recommend that you take a look at Splunk to see if it meets your requirements.

ChrisGNZ 2010-02-17 09:22:59

Answer 2

A:

If you don't want to use a histogram, could you just graph the kernel density?

Jack 2010-02-17 09:35:42

Answer 3

A:

Can nearly concurrent visits be defined or approximated as those that occur in the same second? If yes, here is how I would proceed:

For each second in the data calculate the number of visits. This will include some seconds with 0 visits - don't exclude them.
It is probably reasonable to assume that the number of visits per second has a Poisson distribution with a rate that changes over the day, and perhaps over the week. So decide what are the relevant predictors (time of day, day of the week, month?) and use Poisson regression to model the counts. You can use splines for the continuous variables (eg time of day), I believe there are even some "cyclic" splines that can take into account that 11:58PM is close to 00:02 AM. Or you can cut time into smaller discrete pieces, say 10 minute intervals. If you want to be really fancy, incorporate autocorrelation and overdispersion in the model.
Based on the fitted model, you can estimate whatever percentile you want.

Of course, this is pretty fancy statistically, and you have to know what you are doing, but I think it could work.

Aniko 2010-02-17 21:29:06

Answer 4

+2 A:

You can always place a more flexible model on the arrive rate parameter. For instance, make the arrive rate a function of time, or place some time-series style model on it. Whatever makes sense for your data. The literature typically focuses on the core model because extensions are application specific.

In an extended model, you'll almost certainly want to use Bayesian methods. You are interested in the posterior predictive distribution of the object "almost concurrent events." A recent paper in JASA describes nearly your exact problem, applied to call center data:

Bayesian Forecasting of an Inhomogeneous Poisson Process With Applications to Call Center Data

For a quick solution, don't underestimate the power of histogram style estimators. They are simple nonparametric estimators and you can cross-validate tuning parameters like binwidth and placement. Theoretically this is somewhat unsatisfying, but it would take a day to implement. A fully Bayesian approach likely will dominate, but at significant computational cost.

Tristan 2010-02-18 00:46:11

Answer 5

A:

You're right, most of the theory assumes a Poisson distribution of hits, which you don't have because the rate of hits varies with time of day. However, couldn't you stratify your data into, say, one block for each hour of the day and assume that within a single hour the distribution of hits per second/minute/whatever unit is approximately Poisson? There are probably better ways (from a theoretical perspective), but this way has the advantage of being simple to implement and simple to explain to anyone with any statistical background.

dsimcha 2010-02-18 14:42:25

Answer 6

A:

I think you could argue that your hits are distributed according to a poisson distribution where the average and variation vary with the time of day.

To get a good idea of the peak load I'd start with just a scatterplot with the time of the hit on the horizontal axis and the time between that hit and the next hit on the vertical axis.

This should give you a good idea of the height and duration of your peaks. Then you can estimate the parameters of the poisson distribution for a sliding window of a length similar to that duration for every moment of the day. Sort of like a moving average. The area's where mean and variance are lowest will give you a good basis for estimating expected future peak load.

jilles de wit 2010-02-19 10:53:56

ansaurus

tags:

views:

answers:

How should I analyze web traffic in a statistically correct way?

related questions