tags:

views:

1064

answers:

8

I want to show a random record from the database. I would like to be able to show X number of random records if I choose. Therefore I need to select the top X records from a randomly selected list of IDs

(There will never be more than 500 records involved to choose from, unless the earth dramatically increases in size. Currently there are 66 possibles.)

This function works, but how can I make it better?

/***************************************************/
/* RandomSite */
//****************/
//  Returns an array of random site IDs or NULL
/***************************************************/   
function RandomSite($intNumberofSites = 1) {
    $arrOutput = NULL;
    //open the database
    GetDatabaseConnection('dev');

    //inefficient
    //$strSQL = "SELECT id FROM site_info WHERE major <> 0 ORDER BY RAND() LIMIT ".$intNumberofSites.";";

    //Not wonderfully random
    //$strSQL = "SELECT id FROM site_info WHERE major <> 0 AND id >= (SELECT FLOOR( COUNT(*) * RAND()) FROM site_info ) ORDER BY id LIMIT ".$intNumberofSites.";";

    //Manual selection from available pool of candidates  ?? Can I do this better ??
    $strSQL = "SELECT id FROM site_info WHERE major <> 0;";

    if (is_numeric($intNumberofSites))
    {
     //excute my query
     $result = @mysql_query($strSQL);
     $i=-1;

     //create an array I can work with  ?? Can I do this better ??
     while ($row = mysql_fetch_array($result, MYSQL_NUM))
     {
      $arrResult[$i++] = $row[0];
     }

     //mix them up
     shuffle($arrResult);

     //take the first X number of results  ?? Can I do this better ??
     for ($i=0;$i<$intNumberofSites;$i++)
     {
      $arrOutput[$i] = $arrResult[$i];
     }
    } 

    return $arrOutput;
    }

UPDATE QUESTION: I know about the ORDER BY RAND(), I just don't want to use it because there are rumors it isn't the best at scaling and performance. I am being overly critical of my code. What I have works, ORDER BY RAND() works, but can I make it better?

MORE UPDATE There are holes in the IDs. There is not a ton of churn, but any churn that happens needs to be approved by our team, and therefore could handled to dump any caching.

Thanks for the replies!

+4  A: 

Why not use the Rand Function in an orderby in your database query? Then you don't have to get into randomizing etc in code...

Something like (I don't know if this is legal)

Select *
from site_info
Order by Rand()
LIMIT N

where N is the number of records you want...

EDIT
Have you profiled your code vs. the query solution? I think you're just pre-optimizing here.

Jason Punyon
That's commented out in his code, labeled as inefficient.
Chad Birch
Well don't I feel Sheepish...
Jason Punyon
And it totally works, but the word on the web is it doesn't scale well. I am being hyper-optimized for the sake of being the best I can be =P
MrChrister
You only have 500 records if the world dramatically increases in size...How efficient are you trying to get?
Jason Punyon
I totally am pre optimizing. This is more code golf than solution delivery
MrChrister
You get the answer because your solution wasn't the fastest (by 8/100 of second in my test) , but looked the nicest with good performance
MrChrister
+1  A: 
mysql_query("SELECT id FROM site_info WHERE major <> 0 ORDER BY RAND() LIMIT $intNumberofSites")

EDIT Damn, JPunyon was a bit quicker :)

Matajon
A: 

I would simply use the rand() function (I assume you are using MySQL)...

SELECT id, rand() as rand_idx FROM site_info WHERE major <> 0 ORDER BY rand_idx LIMIT x;
+2  A: 

If you dont want to select with order by rand().

Instead of shuffeling, use array_rand on the result:

$randKeys = array_rand($arrResult, $intNumberofSites);
$arrOutput = array_intersect_key(array_flip($randKeys), $arrResult);

edit: return array of keys not new array with key => value

OIS
But why select all rows and push them all into the array, when you need only portion of them. It is waste of resources and memory.
Matajon
Id go with the rand() in select. But if he wants to pull them all to an array, or if they are fairly static and he wants to cache them, this is another way then the shuffle, for. Which turned out not to be soo great as I first thought ... :P
OIS
I wanted to give you the points, but your solution (as you noted) was on average the slowest over 10,000 iterations. I didn't implement any caching for the test, so you might have won there. Good answer though, I learned stuff which was the point
MrChrister
A: 

Well, I don't think that ORDER BY RAND() would be that slow in a table with only 66 rows, but you can look into a few different solutions anyway.

Is the data really sparse and/or updated often (so there are big gaps in the ids)?

Assuming it's not very sparse, you could select the max id from the table, use PHP's built-in random function to pick N distinct numbers between 1 and the max id, and then attempt to fetch the rows with those ids from the table. If you get back less rows than you picked numbers, get more random numbers and try again, until you have the number of rows needed. This may not be particularly fast either.

If the data is sparse, I would set up a secondary "id-type" column that you make sure is sequential. So if there are 66 rows in the table, ensure that the new column contains the values 1-66. Whenever rows are added to or removed from the table, you will have to do some work to adjust the values in this column. Then use the same technique as above, picking random IDs in PHP, but you don't have to worry about the "missing ID? retry" case.

Chad Birch
A: 

Try this:

SELECT
  @nv := @min + (RAND() * (@max - @min)) / @lc,
  (
  SELECT
    id
  FROM  site_info
  FORCE INDEX (primary)
  WHERE id > @nv
  ORDER BY
    id
  LIMIT 1
  ),
  @max,
  @min := @nv,
  @lc := @lc - 1
FROM
  (
  SELECT @min := MIN(id)
  FROM site_info
  ) rmin,
  (
  SELECT @max := MAX(id)
  FROM site_info
  ) rmax,
  (
  SELECT @lc := 5
  ) l,
  site_info
LIMIT 5

This will select a random ID on each iteration using index, in descending order.

There is slight chance, though, that you get less results that you wanted, as it gives no second chance to the missed id's.

The more percent of rows you select, the bigger is the chance.

Quassnoi
A: 

I'm with JPunyon. Use ORDER BY RAND() LIMIT $N. I think you'll get a bigger performance hit from $arrResult having and shuffling so many (unused) entries than from using the MySQL RAND() function.

function getSites ( $numSites = 5 ) {

    // Sanitize $numSites if necessary

    $result = mysql_query("SELECT id FROM site_info WHERE major <> 0 "
                         ."ORDER BY RAND() LIMIT $numSites");

    $arrResult = array();

    while ( $row = mysql_fetch_array($result,MYSQL_NUM) ) {
        $arrResult[] = $row;
    }

    return $arrResult;
}
James Socol
I'm willing to bet that the network (MySQL server to PHP server) contributes an order of magnitude or more in transferring the unneeded items than having and shuffling them.
Jason Punyon
I'm sure you're right, but either way you'd want to transfer as little between the database and web server as possible.
James Socol
A: 

Here are the three functions I wrote and tested

My answer

/***************************************************/
/* RandomSite1 */
//****************/
//  Returns an array of random rec site IDs or NULL
/***************************************************/   
function RandomSite1($intNumberofSites = 1) {
    $arrOutput = NULL;
    GetDatabaseConnection('dev');
    $strSQL = "SELECT id FROM site_info WHERE major <> 0;";
    if (is_numeric($intNumberofSites))
    {
     $result = @mysql_query($strSQL);
     $i=-1;
     while ($row = mysql_fetch_array($result, MYSQL_NUM)) {
      $arrResult[$i++] = $row[0]; }
     //mix them up
     shuffle($arrResult);
     for ($i=0;$i<$intNumberofSites;$i++) {
      $arrOutput[$i] = $arrResult[$i]; }
    } 
    return $arrOutput;
    }

JPunyon and many others

/***************************************************/
/* RandomSite2 */
//****************/
//  Returns an array of random rec site IDs or NULL
/***************************************************/   
function RandomSite2($intNumberofSites = 1) {
    $arrOutput = NULL;
    GetDatabaseConnection('dev');
    $strSQL = "SELECT id FROM site_info WHERE major<>0 ORDER BY RAND() LIMIT ".$intNumberofSites.";";
    if (is_numeric($intNumberofSites))
    {
     $result = @mysql_query($strSQL);
     $i=0;
     while ($row = mysql_fetch_array($result, MYSQL_NUM)) {
      $arrOutput[$i++] = $row[0]; }
    } 
    return $arrOutput;
    }

OIS with a creative solution meeting the intend of my question.

/***************************************************/
/* RandomSite3 */
//****************/
//  Returns an array of random rec site IDs or NULL
/***************************************************/   
function RandomSite3($intNumberofSites = 1) {
    $arrOutput = NULL;
    GetDatabaseConnection('dev');
    $strSQL = "SELECT id FROM site_info WHERE major<>0;";
    if (is_numeric($intNumberofSites))
    {
     $result = @mysql_query($strSQL);
     $i=-1;
     while ($row = mysql_fetch_array($result, MYSQL_NUM)) {
      $arrResult[$i++] = $row[0]; }
     $randKeys = array_rand($arrResult, $intNumberofSites);
     $arrOutput = array_intersect_key($randKeys, $arrResult);
    } 
    return $arrOutput;
    }

I did a simple loop of 10,000 iterations where I pulled 2 random sites. I closed and opened a new browser for each function, and cleared the cached between run. I ran the test 3 times to get a simple average.

NOTE - The third solution failed at pulling less than 2 sites as the array_rand function has different output if it returns a set or single result. I got lazy and didn't fully implement the conditional to handle that case.

  • 1 averaged: 12.38003755 seconds
  • 2 averaged: 12.47702177 seconds
  • 3 averaged: 12.7124153 seconds
MrChrister