ansaurus

Question

Determining smallest number of samples for 99% accuracy

Answer 1

A:

Is this a trick question? Its 99% right? After checking each one individually you'll know with 100% certainty whether or not its in the remote database, so if you want to check the whole database to 99% accuracy - you have to check 99% of the records (99,000).

PaulG 2010-04-19 17:58:15

It is not a trick question. You can read up on inferential statistics here...http://en.wikipedia.org/wiki/Statistical_inference

Jason Punyon 2010-04-19 18:02:51

That might be true if the database were randomly generated, but in this case, the chances of unchecked records being correct goes up when checked records are found to be correct.

Erik Hermansen 2010-04-19 18:02:59

Good point, Erik. You can't really make a statistical statement without some model of how the two databases were generated. If you're asking yourself "Did I remember to run that job last night?", it might be enough to look at one record!

John D. Cook 2010-04-19 18:08:21

Ahh I see, it makes some sense now I think about it. We're talking probabilities rather than certainties.

PaulG 2010-04-19 18:13:08

Right...it is not the case that every request changes the population, or that the population changes frequently

juwiley 2010-04-19 19:04:08

Answer 2

+5 A:

If you test N records from your local database and all are in the remote database, you can estimate the probability of a local record not being in the remote database as being between 0 and 3/N. This is called the "rule of three" in statistics. I explain it here.

The only way to know that all records are in both databases is to test all of them. But if you test 100 records, for example, you can estimate that the proportion of records not in both databases is below 3%.

John D. Cook 2010-04-19 18:02:10

So the answer using this method is 300 records. If 300/300 match then its 99% *probable* that all 100,000 records match.

PaulG 2010-04-19 18:23:50

Thanks for the answer John. I'm confused though...3/n is invariant? If R had 10^100 records, I could still test 100 records and infer 3% of records are not in either DB?

juwiley 2010-04-19 18:52:21

juwiley: That's right, the sample size you need does not depend on the population size.

John D. Cook 2010-04-19 19:27:34

Answer 3

+3 A:

I would also suggest experimental design for estimating a proportion p.

Suppose that we are interested in estimating the proportion p of the elements in L that also exist in R and we would like to compute a 99% C.I. with a tolerance level (lvl) that is plus or minus 3%. A “conservative” estimate of the size of the random sample would be given by :

n = (Za/2)^2 / (4*lvl^2)

In R

CI<-.99
lvl<-.03    
qnorm(1-(1-CI)/2,0,1)^2/(4*lvl^2)
[1] 1843.027

Check here for details

gd047 2010-04-19 18:31:25

ansaurus

tags:

views:

answers:

Determining smallest number of samples for 99% accuracy

related questions