ansaurus

Question

Select n random rows from SQL Server table

Answer 1

+18 A:

select top 10 percent * from [yourtable] order by newid()

In response to the "pure trash" comment concerning large tables you could do it like as a performance alternative.

select  * from [yourtable] where [yourPk] in 
(select top 10 percent [yourPk] from [yourtable] order by newid())

The cost of this will the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

Ralph Shillington 2009-05-11 16:26:29

I like this approach much better then using the article he referenced.

JoshBerke 2009-05-11 16:30:01

I've used this approach many times. Works like a charm!

Chuck Conway 2009-05-11 16:31:19

Beautiful! Thanks a lot.

John M Gant 2009-05-11 16:31:33

Simple and elegant.

ichiban 2009-05-11 20:18:07

Simple, elegant but it's pure trash when you have millions or more of rows. You will need a different approach if you have 50 million rows or more. You can use the primary key index to look up some rows (randomly) until you get enough results.

Andrei Rinea 2009-05-11 22:36:21

Answer 2

+1 A:

SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

Autocracy 2009-05-11 16:30:59

This will not work. Since the select statement is atomic, it only grabs one random number and duplicates it for each row. You would have to reseed it on each row to force it to change.

Tom H. 2009-05-11 16:42:56

Mmm... love vendor differences. Select is atomic on MySQL, but I suppose in a different way. This will work in MySQL.

Autocracy 2009-05-11 16:44:17

Answer 3

+1 A:

Just order the table by a random number and obtain the first 5,000 rows using TOP.

SELECT TOP 5000 * FROM [Table] ORDER BY newid();

UPDATE

Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

Daniel Brückner 2009-05-11 16:31:11

Answer 4

+7 A:

Depending on your needs, TABLESAMPLE will get you nearly as random and better performance. this is available on MS SQL server 2005 and later.

TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.

on a very large table I tested select top 1 percent * from [tablename] order by newid()

took more than 20 minutes

select * from roi tablesample(1 percent)

took 2

Performance will aslo improve on smaller samples in TABLESAMPLE where in newid() it will not

Please keep in mind that this is not as random as the newid() method but will give you a decent sampling

see the MSDN page here http://msdn.microsoft.com/en-us/library/ms189108.aspx

Patrick Taylor 2009-05-11 20:15:58

Excellent suggestion. I never heard of that function. Thanks.

John M Gant 2009-05-12 01:22:16

Answer 5

+1 A:

My first answer:)

I came across this query which shows how to do it category wise too

http://www.sqlservercurry.com/2009/05/delete-random-records-from-table-using.html

Musa 2009-05-12 04:10:43

Another good idea. Thanks.

John M Gant 2009-05-12 13:23:21

Answer 6

+6 A:

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.

TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).

For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
              / CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

When run against a table with 1,000,000 rows, here are my results:

SET STATISTICS TIME ON
SET STATISTICS IO ON

/* newid()
   rows returned: 10000
   logical reads: 3359
   CPU time: 3312 ms
   elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()

/* TABLESAMPLE
   rows returned: 9269 (varies)
   logical reads: 32
   CPU time: 0 ms
   elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)

/* Filter
   rows returned: 9994 (varies)
   logical reads: 3359
   CPU time: 641 ms
   elapsed time: 627 ms
*/    
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float) 
              / CAST (0x7fffffff AS int)

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

Rob Boek 2009-05-28 18:15:18

Answer 7

A:

ORDER BY NEWID( ) should be sufficient but you must keep an eye on the performance of the query. See if this helps: Selecting Random Records from a Table

Salman A 2009-08-19 10:10:51

ansaurus

tags:

views:

answers:

Select n random rows from SQL Server table

related questions