views:

77

answers:

6

I'd like to go through a table and randomly wipe out some of the data. I'm doing some data randomizing, turning real first names into fake ones etc. Well one of the tables involved has a column that is null about 40% of the time. My name randomizer app could do a coin toss somewhere in it when it assigns new first names. But I'd prefer to just do it at the end: randomly prune some of the data.

I have this code, which doesn't work, but sure looks like it should to me:

Use MyDb
go 

CREATE VIEW vRandNumber 
AS 
SELECT RAND() as RandNumber 

go  

CREATE FUNCTION RandNumber() 
RETURNS float 
AS 
  BEGIN 
  RETURN (SELECT RandNumber FROM vRandNumber) 
  END 

go  

select dbo.RandNumber()

update names set nickname = null 
where ((select dbo.RandNumber()) > 0.5)

When I run the RandNumber function it's fine, plenty random. But when I do the update it's updating all the rows half the time and none of the rows the other half of the time.

I want it to update a random number of rows every time I run the script. I really thought that a function like RandNumber would be run once for every row in the table. Apparently not.

Is this possible without a loop and without a console app?

Edit: I also tried it with a couple of variations of RAND() in the where directly and got the same results.

A: 

RandNumber is a function. Functions in SQL must output the same result every time for the same input, unless underlying database data has changed. This is the mathematical definition of a function (as opposed to how a normal programming language treats a "function," which is more of a function-like construct).

Since the result of your function should never change during an update statement (which is an atomic operation) the query plan compiler for your query only calls RandNumber once and then caches the result.

You may be able to get away with just referencing RAND directly in your query, but if that still doesn't work, you'll have to do this iteratively in a stored procedure.

David Pfeffer
OK, but I've heard that when you do a getdate() as part of a where in a select that getdate() will run once for every row in the select. Is that not correct?
jcollum
In MS-SQL Functions can be deterministic (as you described) or non-deterministic see http://msdn.microsoft.com/en-us/library/aa214775%28SQL.80%29.aspx the RAND function is non deterministic.
David Waters
In this case, nearly every function will be run deterministically because `update` is atomic. Therefore, even if a normally non-deterministic function is called, the schema-bound nature of it is negated by the fact that the underlying data cannot change during the query execution. I think this **has** to be done iteratively unless you can force the query plan compiler to treat the function as non-deterministic despite the fact that it won't want to do so.
David Pfeffer
@David: excellent explanation, thanks.
jcollum
A: 

How about

update names set nickname = null
where  abs(checksum(nickname) % 2) = 0
Hogan
Nope, I ran the update with your where and it updated all the rows every time I ran it.
jcollum
@jcollumn, how does the odd/even one do?
Hogan
@jcollumn - ok I tested this one, it will work... warning null name or "constant" names will all do the same thing so it is not totally random.
Hogan
actually that works well because the nickname column is being randomized somewhere else, using C# code so it'll be more random
jcollum
yes I saw that in your question -- as @Quassnoi points out you can use newid() to get a random string and checksum that, but given your dataset this is faster.
Hogan
A: 

try something like this:

WHERE DATEPART(ms,CreateDate)>500

where "CreateDate" is a column already in the table that has an actual date and time in it. Tthe millisecond should be fairly random

EDIT here's another approach:

DECLARE @YourTable table (RowID int, RowValue varchar(5))
INSERT INTO @YourTable VALUES (1,'one')
INSERT INTO @YourTable VALUES (2,'two')
INSERT INTO @YourTable VALUES (3,'three')

SELECT 
    RAND(row_number() over(order by RowID)+DATEPART(ms,GETDATE())),* 
    FROM @YourTable

OUTPUT run 1:

                       RowID       RowValue
---------------------- ----------- --------
0.716200609189072      1           one
0.71621924216033       2           two
0.716237875131588      3           three

(3 row(s) affected)

OUTPUT run 2:

                       RowID       RowValue
---------------------- ----------- --------
0.727007732518828      1           one
0.727026365490086      2           two
0.727044998461344      3           three

(3 row(s) affected)
KM
+4  A: 

Assuming your Names table has a primary key field called Id, this will nullify the nickname in a random 50 percent of the rows:

update dbo.Names set Nickname = null where Id in
(
    select top 50 percent id from dbo.Names order by NEWID()
)
Ian Nelson
This works but is inefficient -- generating a new GUID for each row will take a while longer than generating a single integer for the row.
David Pfeffer
Pfft, c'mon! Is super performance really an issue for this kind of use case? Talk about premature optimization.I've just tried this approach on a table with 182,770 rows and it ran in 11 seconds.
Ian Nelson
Well, it's close. But it will always update the same number of rows. I think I need a loop to update a random number of rows.
jcollum
Yeah, the performance is fine, I got 655k rows in 25s. This isn't the kind of thing that will be in prod code.
jcollum
OK, so add a line or two beforehand to randomly determine how many rows to update, between some range, and use that absolute figure instead of the "50 percent".
Ian Nelson
I still think its the best solution and I still upvoted it. I'm just pointing out there's probably a better way.
David Pfeffer
I don't think it's entirely what I want, but it does the job. I tried doing this the right way (getting all the ids, then wiping data in each row based on a coin toss, one at a time) and the performance was awful.
jcollum
A: 

RAND() (and GetDate/CURRENT_TIMESTAMP) is evaluated once per statement. You need someway to get around that. One way is (if you have a convenient row valued integer, e.g. an ID column), is to call RAND(ID) instead.

Damien_The_Unbeliever
Still only called once.
Hogan
Um, no. RAND(ID), where ID varies on a per-row basis, is called once per row.
Damien_The_Unbeliever
A: 

RAND() persists within a query.

SELECT  RAND()
FROM    names

will give you a set of equal numbers.

You need to do something like this:

WITH    q AS
        (
        SELECT  *,
                ABS(CHECKSUM(NEWID())) % 2 AS r
        FROM    names
        )
UPDATE  q
SET     nickname = NULL
WHERE   r = 0
Quassnoi