views:

441

answers:

3

I need to explain to the client why dupes are showing up between 2 supposedly different exams. It's been 20 years since Prob and Stats.

I have a generated Multiple choice exam. There are 192 questions in the database, 100 are chosen at random (no dupes).

Obviously, there is a 100% chance of there being at least 8 dupes between any two exams so generated. (Pigeonhole principle)

How do I calculate the probability of there being 25 dupes? 50 dupes? 75 dupes?

-- Edit after the fact -- I ran this through excel, taking sums of the probabilities from n-100, For this particular problem, the probabilities were

n   P(n+ dupes)
40  97.5%
52  ~50% 
61  ~0
A: 

Its probably higher than you think. I won't attempt to duplicate this article: http://en.wikipedia.org/wiki/Birthday_paradox

Chris
Please use [link text](URL) to create a clickable link.
cjm
done, hit code button rather than hyperlink :S
Chris
I looked at that, and it's great for finding the probability of a single dupe, it's a bit tougher to come up with the dupe probability distribution.
chris
+1  A: 

Once you've created the first exam, there are 92 questions that have never been used, and 100 that have. If you now generate another exam, with 100 questions in in it, you are chosing from a set of 92 questions that have never been used, and 100 that have. Clearly you are going to get quite a few duplicates.

You would expect to get (100/192) * 100 duplicates, i.e. in any two randomly chosen exams, there will (on average) be 52 duplicate questions.

If you want the probability that there are 25, or 75, or whatever, then you have two choices.

a) Work out the maths

b) Simulate a few runs on a computer

Airsource Ltd
You should say that the **expected** number of duplicates is 52.
David Nehme
indeed. Corrected.
Airsource Ltd
+2  A: 

Erm, this is really really hazy for me. But there are (192 choose 100) possible exams, right?

And there are (100 choose N) ways of picking N dupes, each with (92 choose 100-N) ways of picking the rest of the questions, no?

So isn't the probability of picking N dupes just:

(100 choose N) * (92 choose 100-N) / (192 choose 100)

EDIT: So if you want the chances of N or more dupes instead of exactly N, you have to sum the top half of that fraction for all values of N from the minimum number of dupes up to 100.

Errrr, maybe...

That looks good! I'll wait for criticism before accepting.
chris
Looks good to me but that's the probability of exactly N duplicates. To get probability of at least N duplicates - which, I think, is what chris is interested in, one has to sum a bit
Maciej Hehl
@Maciej: Doh, of course. Thanks
Yeah, that did it. I figured out the summation on my own.
chris