tags:

views:

287

answers:

6

I've read that it isn't a good idea to run SQL queries with ORDER BY RAND() on large databases.

So here's my shot at breaking up the code. The code needs to select 10 random ids from the database, then do a second select to grab the random rows.

$sql = "SELECT id FROM table WHERE image != ''
     ORDER BY id DESC LIMIT 50;";

$result = mysql_query($sql);


while($row = mysql_fetch_array($result))
{
    foreach($row as $key => $value)
    {
     $array[] = $value;
    }
}
$rand_keys = array_rand($array, 10);

foreach($rand_keys as $value)
{

    $rand_arr[] = $array[$value];

}
$rand_list = implode("," , $rand_arr);

$sql = "SELECT image FROM table
     WHERE image != ''
     AND id IN ($rand_list)";
$result = mysql_query($sql);

Any suggestions to speed up or simplify?

+1  A: 

Five Four things:

  1. Why are you fetching 50 ids if you only need 12? (You're picking 12 ids from the last 50 - that makes sense, though is not particularly random in the general sense of the word - is that intentionally the subset of your rows you want to pick random rows from?).

  2. Have you profiled the SQL statement ORDER BY RAND() to see if it's slow for you? How large is your dataset?

  3. You don't need the WHERE image != '' in the last query, since you've already only picked out ids for which image != ''.

  4. Why are you doing array_rand($array, 10) - you say you want 12 values?

  5. You can simplify the picking out of random values like this:

$rand_arr = array_rand(array_flip($array), 12);

Dominic Rodger
I would really avoid `order by rand()` as a general habit. I've seen this grind busy MyISAM tables of only a few hundred thousand rows.
memnoch_proxy
C'mon now Dominic. #3 and #4 are obvious typo errors/overlooked ideas. My mistake. #5 was very helpful. #2 is on my mind. I'm running this code on a shared server and it brought my website to a halt (same effect with ORDER BY rand() actually.) This has made me rethink random rows entirely, ie. serving random images.
rrrfusco
A: 

I'd agree with points 1 and 2 above - if you can perform the selection of random data within your application at the same level at which the data reside, the less cruft you'll need to write to do the same thing.

Sasha
A: 

There's no particularly great way to do this elegantly.

But you can hack at it from a number of directions. If your dataset is about the right size (too big for "order by rand()" but not too big), have sequential id values, and don't generally delete much, you could always do something like this:

SELECT MIN(id) as min, MAX(id) as max FROM table

Generate some number N of random integers between "min" and "max" (inclusive). Let's call it 50. If you never delete anything from the table, N can probably be 12. If you delete, do some napkin arithmatic and figure out a good number. You can probably err on the high side.

SELECT * FROM table WHERE id IN (<your set of integers>) AND image_id = '' LIMIT 12;

Check to make sure you've got at least 12 results. If not, basically repeat and combine.

For large sets, this method should work much better than ORDER BY RAND(), especially if your ID sequence isn't very sparse.

timdev
I like this approach but it's only appropriate in a single-master setup where auto_increment is monotonic: auto_increment_increment, auto_increment_offset settings could spread those ids out.
memnoch_proxy
Good point. This only works up until you start dealing with multiple masters.
timdev
A: 

I would focus on Dominic's point 5 as a fairly low impact way to retrieve data randomly. You can sort() that list of IDs also (I believe MySQL detects this and skips sorting that list for you.)

There are other techniques for large data sets and high request rates that involve materialized views (essentially caching a table). Are trying to solve an existing performance problem on a big, busy table?

memnoch_proxy
A: 

Another option is to use a random seeded hash function and sort on that.

Retrieve the maximum and minimum ids for the table, and use PHP's rand() function to generate a random number between the max and min.

Then use that number to seed you hash function .Assume {salt} in the SQL means the random integer generated by PHP

SELECT image FROM table
WHERE image != ''
ORDER BY MOD(ABS({salt}-id), MOD({salt}, 10)), ABS({salt}-id));

You could optimize a bit performing the MOD({salt}, 10) calculation in PHP and passing the value in the query.

jonthornton
A: 

If the rowsize is not excessively large, I would profile just selecting 50 rows and keeping a random list of 12 of them in the application. Yes, that means you are throwing away 80% of the selected rows. When you are talking 80% of 50 is that really a crime? This is the kind of thing SQL is not good at.

jmucchiello