tags:

views:

95

answers:

3

I'm trying to compare 100,000 records on a local database (L) with 100,000 records on a remote database (R).

Basically I want to know if an element in L exists in R. To determine that, I have to make a request against the R for each L, which takes a long time (I know, there should be a better way, there isn't, that's the API I've got).

So I would like to test a small sample of L against R, and then infer with some level of confidence how many are present in the whole R. How many do I have to test to have a 99% confidence level?

A: 

Is this a trick question? Its 99% right? After checking each one individually you'll know with 100% certainty whether or not its in the remote database, so if you want to check the whole database to 99% accuracy - you have to check 99% of the records (99,000).

PaulG
It is not a trick question. You can read up on inferential statistics here...http://en.wikipedia.org/wiki/Statistical_inference
Jason Punyon
That might be true if the database were randomly generated, but in this case, the chances of unchecked records being correct goes up when checked records are found to be correct.
Erik Hermansen
Good point, Erik. You can't really make a statistical statement without some model of how the two databases were generated. If you're asking yourself "Did I remember to run that job last night?", it might be enough to look at one record!
John D. Cook
Ahh I see, it makes some sense now I think about it. We're talking probabilities rather than certainties.
PaulG
Right...it is not the case that every request changes the population, or that the population changes frequently
juwiley
+5  A: 

If you test N records from your local database and all are in the remote database, you can estimate the probability of a local record not being in the remote database as being between 0 and 3/N. This is called the "rule of three" in statistics. I explain it here.

The only way to know that all records are in both databases is to test all of them. But if you test 100 records, for example, you can estimate that the proportion of records not in both databases is below 3%.

John D. Cook
So the answer using this method is 300 records. If 300/300 match then its 99% *probable* that all 100,000 records match.
PaulG
Thanks for the answer John. I'm confused though...3/n is invariant? If R had 10^100 records, I could still test 100 records and infer 3% of records are not in either DB?
juwiley
juwiley: That's right, the sample size you need does not depend on the population size.
John D. Cook
+3  A: 

I would also suggest experimental design for estimating a proportion p.

Suppose that we are interested in estimating the proportion p of the elements in L that also exist in R and we would like to compute a 99% C.I. with a tolerance level (lvl) that is plus or minus 3%. A “conservative” estimate of the size of the random sample would be given by :

n = (Za/2)^2 / (4*lvl^2)

In R

CI<-.99
lvl<-.03    
qnorm(1-(1-CI)/2,0,1)^2/(4*lvl^2)
[1] 1843.027

Check here for details

gd047