views:

1448

answers:

4

I'm trying to select a random 10% sampling from a small table. I thought I'd just use the RAND() function and select those rows where the random number is less than 0.10:

SELECT * FROM SomeTable WHERE SomeColumn='SomeCondition' AND RAND() < 0.10

But I soon discovered that RAND() always returns the same number! Reminds me of this xkcd cartoon.

OK, no problem, the RAND function takes a seed value. I will be running this query periodically, and I want it to give different results if I run it on a different day, so I seed it with a combination of the date and a unique row ID:

SELECT * FROM SomeTable WHERE SomeColumn='SomeCondition' AND RAND(CAST(GETDATE) AS INTEGER) + RowID) < 0.10

I still don't get any results! When I show the random numbers returned by RAND, I discover that they're all within a narrow range. It appears that getting a random number from RAND requires you to use a random seed. If I had a random seed in the first place, I wouldn't need a random number!

I've seen the previous discussions related to this problem:
http://stackoverflow.com/questions/52964/sql-server-random-sort
http://stackoverflow.com/questions/19412/how-to-request-a-random-row-in-sql

They don't help me. TABLESAMPLE works at the page level, which is great for a big table but not for a small one, and it looks like it applies prior to the WHERE clause. TOP with NEWID doesn't work because I don't know ahead of time how many rows I want.

Anybody have a solution, or at least a hint?

Edit: Thanks to AlexCuse for a solution which works for my particular case. Now to the larger question, how to make RAND behave?

+2  A: 

If your table has a numeric (numeric in the general sense, like integer, floating point or SQL numeric!) column (perhaps even the rowid column), please try the following:

SELECT * FROM SomeTable WHERE SomeColumn='SomeCondition' AND 0*rowid+RAND() < 0.10

in order to evaluate RAND() once for every row, not once at the start of your query.

The query optimizer is to blame. Perhaps there is another way, but I believe this will work for you.

ΤΖΩΤΖΙΟΥ
That method isn't working for me. I only have varchar and int, and I don't understand why int would behave different than numeric.
Mark Ransom
If it's the optimizer then it's very much like the cartoon!
Ken
Perhaps it's my fault, but saying "numeric" I meant any numeric type (integer, shortint, float, "SQL" numeric etc). Check my answer again, please.
ΤΖΩΤΖΙΟΥ
+6  A: 

This type of approach (shown by ΤΖΩΤΖΙΟΥ) will not guarantee a 10% sampling. It will only give you all rows where Rand() is evaluated to < .10 which will not be consistent.

Something like

select top 10 percent * from MyTable order by NEWID()

will do the trick.

edit: there is not really a good way to make RAND behave. This is what I've used in the past (kludge alert - it kills you not being able to use Rand() in a UDF)

CREATE VIEW RandView AS 

SELECT RAND() AS Val

GO

CREATE FUNCTION RandomFloat()
RETURNS FLOAT
AS
BEGIN

RETURN (SELECT Val FROM RandView)

END

Then you just have select blah, dbo.RandomFloat() from table in your query.

AlexCuse
An approximation to 10% was good enough for me, but your answer solves my immediate problem nicely. I should have thought to check for a PERCENT clause on TOP.
Mark Ransom
A: 

Did you see this question?

http://stackoverflow.com/questions/94906/how-do-i-return-random-numbers-as-a-column-in-sql-server-2005

Adam posted a UDF you can use in place of Rand() that works much better.

Joshua Carmody
A: 

This seems to work:

select * from SomeTable
where rand(0*SomeTableID + cast(cast(newid() as binary(4)) as int)) <= 0.10
Jason DeFontes