ansaurus

Question

Simple Random Samples from a (My)Sql database

Answer 1

A:

This answer might help, especially the linked article

http://stackoverflow.com/questions/211329/quick-selection-of-a-random-row-from-a-large-table-in-mysql#211388

Vinko Vrsalovic 2008-10-30 04:53:30

The answer you linked has a LIMIT 1 at the end of the query. I don't see a way to extend it to 10,000 rows; changing the limit would just select 10,000 entries with adjacent IDs, from a random starting point.

ojrac 2008-10-30 04:59:23

Yes, did you see the first attempts in the linked article? I think something along the lines of the first tries (which were wrong because of doing rand() more than once) might fit this case. If I get a chance I'll give it a shot :)

Vinko Vrsalovic 2008-10-30 05:26:21

Answer 2

A:

Maybe you could do

SELECT * FROM table LIMIT 10000 OFFSET FLOOR(RAND() * 190000)

staticsan 2008-10-30 05:29:34

It looks like that would select a random slice of my data; I'm looking for something a little more complicated -- 10,000 randomly-distributed rows.

ojrac 2008-10-30 05:35:35

Then your only option, if you want to do it in the database, is ORDER BY rand().

staticsan 2008-11-03 00:29:03

Answer 3

+3 A:

2008-10-31 03:59:18

I would recommend adding a unique index on the random key selection and perhaps ignoring duplicates on the insert, then you can get rid of the distinct stuff and the join will be faster.

Sam Saffron 2008-10-31 07:08:53

I think the random number algorithm could use some tweaks -- either a UNIQUE constraint as mentioned, or just generate 2*m numbers, and SELECT DISTINCT, ORDER BY id (first-come-first-serve, so this reduces to the UNIQUE constraint) LIMIT m.I like it.

ojrac 2008-10-31 15:15:37

As to adding a unique index to the random key selection and then ignoring duplicates on insert, I thought this may get you back to O(m^2) behavior instead of O(m lg m) for a sort. Not sure how efficient the server is maintaining the index when inserting random rows one at a time.

2008-10-31 16:02:28

As to suggestions to generate 2*m numbers or something, I wanted an algorithm guaranteed to work no matter what. There's always the (slim) chance that your 2*m random numbers will have more than m duplicates, so you won't have enough for your query.

2008-10-31 16:05:18

As long as you pay attention to the birthday paradox, you can easily generate a quantity of random numbers with an astronomically low chance of <m unique values. But, at worst, you could always generate another m keys until you've got enough unique ones. ;)

ojrac 2008-11-01 17:10:14

The way I suggested, since the chance of duplicates would be astronomically low anyway, I just generate one new one at a time if necessary. Very unlikely we even need one more.

2008-11-02 02:25:28

ansaurus

tags:

views:

answers:

Simple Random Samples from a (My)Sql database

related questions