views:

1093

answers:

6

I'm looking for a homegrown way to scramble production data for use in development and test. I've built a couple of scripts that make random social security numbers, shift birth dates, scramble emails, etc. But I've come up against a wall trying to scramble customer names. I want to keep real names so we can still use or searches so random letter generation is out. What I have tried so far is building a temp table of all last names in the table then updating the customer table with a random selection from the temp table. Like this:

DECLARE @Names TABLE (Id int IDENTITY(1,1),[Name] varchar(100))

/* Scramble the last names (randomly pick another last name) */
INSERT @Names SELECT LastName FROM Customer ORDER BY NEWID();
WITH [Customer ORDERED BY ROWID] AS
(SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer)
UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM @Names WHERE ROWID=Id)

This worked well in test, but completely bogs down dealing with larger amounts of data (>20 minutes for 40K rows)

All of that to ask, how would you scramble customer names while keeping real names and the weight of the production data?

UPDATE: Never fails, you try to put all the information in the post, and you forget something important. This data will also be used in our sales & demo environments which are publicly available. Some of the answers are what I am attempting to do, to 'switch' the names, but my question is literally, how to code in T-SQL?

+1  A: 

A very simple solution would be to ROT13 the text.

A better question may be why you feel the need to scramble the data? If you have an encryption key, you could also consider running the text through DES or AES or similar. Thos would have potential performance issues, however.

warren
As I said, I need real names with similar/same weight as production so searches perform similarly.
Computer Chip
Additional ROT13 doesn't actually scramble the data, since it is an easily reversible algorithm...
Guvante
yes, it's easily reversible - but it does meet the criteria of "mask" or "obfuscate" - you at least need to recognize it's been ROT13's, and un-ROT it :)
warren
+3  A: 

I use generatedata. It is an open source php script which can generate all sorts of dummy data.

Peter Hoffmann
excellent tip - thank you. [It's one of those things that I've been meaning to write for years but never had time]...
Richard Harrison
A: 

Why not just use some sort of Random Name Generator?

Ryan
+2  A: 

Frankly, I'm not sure why this is needed. Your dev/test environments should be private, behind your firewall, and not accessible from the web.

Your developers should be trusted, and you have legal recourse against them if they fail to live up to your trust.

I think the real question should be "Should I scramble the data?", and the answer is (in my mind) 'no'.

If you're sending it offsite for some reason, or you have to have your environments web-accessible, or if you're paranoid, I would implement a random switch. Rather than build a temp table, run switches between each location and a random row in the table, swapping one piece of data at a time.

The end result will be a table with all the same data, but with it randomly reorganized. It should also be faster than your temp table, I believe.

It should be simple enough to implement the Fisher-Yates Shuffle in SQL...or at least in a console app that reads the db and writes to the target.

Edit (2): Off-the cuff answer in T-SQL:

declare @name varchar(50) set @name = (SELECT lastName from person where personID = (random id number) Update person set lastname = @name WHERE personID = (person id of current row)

Wrap this in a loop, and follow the guidelines of Fisher-Yates for modifying the random value constraints, and you'll be set.

Jeff
Computer Chip
You could try the <a href="http://en.wikipedia.org/wiki/Fisher-Yates_shuffle">Fisher-Yates Shuffle</a> It should be simple enough to implement in SQL...or in a simple console app that reads in the db and writes to the target db.
Jeff
http://en.wikipedia.org/wiki/Fisher-Yates_shuffleThat's the correct link, guess I have to learn more about the environment here ;)Updating my answer.
Jeff
+1  A: 

When doing something like that I usually write a small program that first loads a lot of names and surnames in two arrays, and then just updates the database using random name/surname from arrays. It works really fast even for very big datasets (200.000+ records)

Milan Babuškov
A: 

Use a temporary table instead and the query is very fast. I just ran on 60K rows in 4 seconds. I'll be using this one going forward.

DECLARE TABLE #Names (Id int IDENTITY(1,1),[Name] varchar(100))

/* Scramble the last names (randomly pick another last name) */ INSERT #Names SELECT LastName FROM Customer ORDER BY NEWID(); WITH [Customer ORDERED BY ROWID] AS (SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS ROWID, LastName FROM Customer) UPDATE [Customer ORDERED BY ROWID] SET LastName=(SELECT [Name] FROM #Names WHERE ROWID=Id) DROP TABLE #Names

You still could end up with a bad roll and have two...wait. NewID() makes UUIDs. I stand corrected.
Broam