views:

1513

answers:

3

How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.

The "obvious" answer is to:

SELECT * FROM table ORDER BY RAND() LIMIT 10000

For large tables, that's too slow: it calls RAND() for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?

A: 

This answer might help, especially the linked article

http://stackoverflow.com/questions/211329/quick-selection-of-a-random-row-from-a-large-table-in-mysql#211388

Vinko Vrsalovic
The answer you linked has a LIMIT 1 at the end of the query. I don't see a way to extend it to 10,000 rows; changing the limit would just select 10,000 entries with adjacent IDs, from a random starting point.
ojrac
Yes, did you see the first attempts in the linked article? I think something along the lines of the first tries (which were wrong because of doing rand() more than once) might fit this case. If I get a chance I'll give it a shot :)
Vinko Vrsalovic
A: 

Maybe you could do

SELECT * FROM table LIMIT 10000 OFFSET FLOOR(RAND() * 190000)
staticsan
It looks like that would select a random slice of my data; I'm looking for something a little more complicated -- 10,000 randomly-distributed rows.
ojrac
Then your only option, if you want to do it in the database, is ORDER BY rand().
staticsan
+3  A: 
I would recommend adding a unique index on the random key selection and perhaps ignoring duplicates on the insert, then you can get rid of the distinct stuff and the join will be faster.
Sam Saffron
I think the random number algorithm could use some tweaks -- either a UNIQUE constraint as mentioned, or just generate 2*m numbers, and SELECT DISTINCT, ORDER BY id (first-come-first-serve, so this reduces to the UNIQUE constraint) LIMIT m.I like it.
ojrac
As to adding a unique index to the random key selection and then ignoring duplicates on insert, I thought this may get you back to O(m^2) behavior instead of O(m lg m) for a sort. Not sure how efficient the server is maintaining the index when inserting random rows one at a time.
As to suggestions to generate 2*m numbers or something, I wanted an algorithm guaranteed to work no matter what. There's always the (slim) chance that your 2*m random numbers will have more than m duplicates, so you won't have enough for your query.
As long as you pay attention to the birthday paradox, you can easily generate a quantity of random numbers with an astronomically low chance of <m unique values. But, at worst, you could always generate another m keys until you've got enough unique ones. ;)
ojrac
The way I suggested, since the chance of duplicates would be astronomically low anyway, I just generate one new one at a time if necessary. Very unlikely we even need one more.