tags:

views:

9710

answers:

15

What is the best way to request a random row in pure SQL?

+8  A: 

Dunno how efficient this is, but I've used it before:

SELECT TOP 1 * FROM MyTable ORDER BY newid()

Because GUIDs are pretty random, the ordering means you get a random row.

Matt Hamilton
A: 
 SELECT * FROM table ORDER BY RAND() LIMIT 1
yjerem
+6  A: 

You didn't say which server you're using. In older versions of MSSQL, you can use this:

select top 1 * from mytable order by newid()

In SQL Server 2005 and up, you can use TABLESAMPLE to get a random sample that's repeatable:

SELECT FirstName, LastName
FROM Contact 
TABLESAMPLE (1 ROWS) ;
Jon Galloway
MSDN says newid() is preferred over tablesample for truly random results: http://msdn.microsoft.com/en-us/library/ms189108.aspx
Andrew Hedges
+9  A: 

See this post: SQL to Select a random row from a database table. It goes through methods for doing this in MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2 and Oracle.

Yaakov Ellis
+1  A: 

Best way is putting a random value in a new column just for that purpose, and using something like this (pseude code + SQL):

randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")

This is the solution employed by the MediaWiki code. Of course, there is some bias against smaller values, but they found that it was sufficient to wrap the random value around to zero when no rows are fetched.

newid() solution may require a full table scan so that each row can be assigned a new guid, which will be much less performant.

rand() solution may not work at all (i.e. with MSSQL) because the function will be evaluated just once, and every row will be assigned the same "random" number.

Ishmaeel
Wrapping around when you get 0 results provides a provably random sample (not just "good enough"). This solution *almost* scales to multi-row queries (think "party shuffle"). The problem is that results tend to be selected in the same groups repeatedly. To get around this, you would need to re-distribute the random numbers you have just used. You could cheat by keeping track of randomNo and setting it to max(randomness) from results, but then p(row i on query 1 AND row i on query 2) == 0, which isn't fair. Let me do some maths, and I'll get back to you with a truly fair scheme.
alsuren
+32  A: 

Solutions like Jeremies:

SELECT * FROM table ORDER BY RAND() LIMIT 1

work, but they need a sequential scan of all the table (because the random value associated with each row needs to be calculated - so that the smallest one can be determined), which can be quite slow for even medium sized tables. My recommendation would be to use some kind of indexed numeric column (many tables have these as their primary keys), and then write something like:

SELECT * FROM table WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table) LIMIT 1

This works in constant time, regardless of the table size, if num_value is indexed. One caveat: this assumes that num_value is equally distributed in the range 0..MAX(num_value). If your dataset strongly deviates from this assumption, you will get skewed results (some rows will appear more often than others).

Cd-MaN
A: 

I have to agree with CD-MaN: Using "ORDER BY RAND()" will work nicely for small tables or when you do your SELECT only a few times.

I also use the "num_value >= RAND() * ..." technique, and if I really want to have random results I have a special "random" column in the table that I update once a day or so. That single UPDATE run will take some time (especially because you'll have to have an index on that column), but it's much faster than creating random numbers for every row each time the select is run.

BlaM
+4  A: 

Found this by googling.

Select a random row with MySQL:

SELECT column FROM table
ORDER BY RAND()
LIMIT 1

Select a random row with PostgreSQL:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

Select a random row with Microsoft SQL Server:

SELECT TOP 1 column FROM table
ORDER BY NEWID()

Select a random row with IBM DB2

SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY

Select a random record with Oracle:

SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1
cnu
All of those are very costly because they generate a result set as large as the table, sort it, and then return a single row.
Bill Karwin
+1  A: 

For SQL Server 2005 and 2008, if we want a random sample of individual rows (from Books Online):

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
santiiiii
A: 

Be careful because TableSample doesn't actually return a random sample of rows. It directs your query to look at a random sample of the 8KB pages that make up your row. Then, your query is executed against the data contained in these pages. Because of how data may be grouped on these pages (insertion order, etc), this could lead to data that isn't actually a random sample.

See: http://www.mssqltips.com/tip.asp?tip=1308

This MSDN page for TableSample includes an example of how to generate an actualy random sample of data.

http://msdn.microsoft.com/en-us/library/ms189108.aspx

Sean Turner
A: 

For SQL Server

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)

The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

When run against a table with 1,000,000 rows, here are my results:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

Rob Boek
A: 

Most of the solutions here aim to avoid sorting, but they still need to make a sequential scan over a table.

There is also a way to avoid the sequential scan by switching to index scan. If you know the index value of your random row you can get the result almost instantially. The problem is - how to guess an index value.

The following solution works on PostgreSQL 8.4:

explain analyze select * from cms_refs where rec_id in 
  (select (random()*(select last_value from cms_refs_rec_id_seq))::bigint 
   from generate_series(1,10))
  limit 1;

I above solution you guess 10 various random index values from range 0 .. [last value of id].

The number 10 is arbitrary - you may use 100 or 1000 as it (amazingly) doesn't have a big impact on the response time.

There is also one problem - if you have sparse ids you might miss. The solution is to have a backup plan :) In this case an pure old order by random() query. When combined id looks like this:

explain analyze select * from cms_refs where rec_id in 
    (select (random()*(select last_value from cms_refs_rec_id_seq))::bigint 
     from generate_series(1,10))
    union all (select * from cms_refs order by random() limit 1)
    limit 1;

Not the union ALL clause. In this case if the first part returns any data the second one is NEVER executed!

hegemon
A: 
SELECT * FROM table WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table) LIMIT 1

This doesn't seem to me to work correctly.

This query would calculate RAND() for each and every row, select rows for which the condition (with different RAND()'s) evaluates to TRUE, and then pick 1 row by some order (by id, for example). So the lower ID for the row, the higher probability for it to be selected. In a test table with ~20k row selected rows are nearly always below 300-400.

yurique
+1  A: 

In late, but got here via Google, so for the sake of posterity, I'll add an alternative solution.

Another approach is to use TOP twice, with alternating orders. I don't know if it is "pure SQL", because it uses a variable in the TOP, but it works in SQL Server 2008. Here's an example I use against a table of dictionary words, if I want a random word.

SELECT TOP 1
  word
FROM (
  SELECT TOP(@idx)
    word 
  FROM
    dbo.DictionaryAbridged WITH(NOLOCK)
  ORDER BY
    word DESC
) AS D
ORDER BY
  word ASC

Of course, @idx is some randomly-generated integer that ranges from 1 to COUNT(*) on the target table, inclusively. If your column is indexed, you'll benefit from it too. Another advantage is that you can use it in a function, since NEWID() is disallowed.

Lastly, the above query runs in about 1/10 of the exec time of a NEWID()-type of query on the same table. YYMV.

alphadogg
A: 

I'm using MS SQL server, SELECT TOP 1 * FROM some_table_name ORDER BY NEWID() worked great for me, thanks for the advice guys!

Will