Statistical Analysis of Server Logs - Correctness of Extrapolation

We had an ISP failure for about 10 minutes one day, which unfortunately occurred during a hosted exam that was being written from multiple locations.

Unfortunately, this resulted in the loss of postback data for candidates' current page in progress.

I can reconstruct the flow of events from the server log. However, of 317 candidates, 175 were using a local proxy, which means they all appear to come from the same IP. I've analyzed the data from the remaining 142 (45%), and come up with some good numbers as to what happened with them.

Question: How correct is it to multiply all my numbers by 317/142 to achieve probable results for the entire set? What would be my region of (un)certainty?

Please, no guesses. I need someone who didn't fall asleep in stats class to answer.

EDIT: by numbers, I was refering to counts of affected individuals. for example, 5/142 showed evidence of a browser crash during the session. How correct is the extrapolation of 11/317 having browser crashes?

I'm not sure exactly what measurements we are talking about, but for now let's assume that you want something like the average score. No adjustment is necessary for estimating the mean score of the population (the 317 candidates). Just use the mean of the sample (the 142 whose data you analyzed).

To find your region of uncertainty you can use the formula given in the NIST statistics handbook. You must first decide how uncertain you are willing to be. Let's assume that you want 95% confidence that the true population mean lies within the interval. Then, the confidence interval for the true population mean will be:

(sample mean) +/- 1.960*(sample standard deviation)/sqrt(sample size)

There are further corrections you can make to take credit for having a large sample relative to the population. They will tighten the confidence interval by about 1/4, but there are plenty of assumptions that the above calculation makes that already make it less conservative. One assumption is that the scores are approximately normally distributed. The other assumption is that the sample is representative of the population. You mentioned that the missing data are all from candidates using the same proxy. The subset of the population that used that proxy could be very different from the rest.

EDIT: Since we are talking about a proportion of the sample with an attribute, e.g. "browser crashed", things are a little different. We need to use a confidence interval for a proportion, and convert it back to a number of successes by multiplying by the population size. This means that our best-guess estimate of the number of crashed browsers is 5*317/142 ~= 11 as you suggested.

If we once again ignore the fact that our sample is nearly half of the population, we can use the Wilson confidence interval of a proportion. A calculator is available online to handle the formula for you. The output from the calculator and the formula is upper and lower limits for the fraction in the population. To get a range for the number of crashes, just multiply the upper and lower limits by (population size - sample size) and add back the number of crashes in the sample. While we could simply multiply by the population size to get the interval, that would ignore what we already know about our sample.

Using the procedure above gives a 95% C.I. of 7.6 to 19.0 for the total number of browser crashes in the population of 317, based on 5 crashes in the 142 sample points.

ansaurus

tags:

views:

answers:

Statistical Analysis of Server Logs - Correctness of Extrapolation

related questions