views:

90

answers:

3

A search of SO yields many results describing how to select random rows of data from a database table. My requirement is a bit different, though, in that I'd like to select individual columns from across random rows in the most efficient/random/interesting way possible.

To better illustrate: I have a large Customers table, and from that I'd like to generate a bunch of fictitious demo Customer records that aren't real people. I'm thinking of just querying randomly from the Customers table, and then randomly pairing FirstNames with LastNames, Address, City, State, etc.

So if this is my real Customer data (simplified):

FirstName  LastName  State  
==========================
Sally      Simpson   SD
Will       Warren    WI    
Mike       Malone    MN
Kelly      Kline     KS

Then I'd generate several records that look like this:

FirstName  LastName  State  
==========================
Sally      Warren    MN
Kelly      Malone    SD

Etc.

My initial approach works, but it lacks the elegance that I'm hoping the final answer will provide. (I'm particularly unhappy with the repetitiveness of the subqueries, and the fact that this solution requires a known/fixed number of fields and therefore isn't reusable.)

SELECT 
FirstName = (SELECT TOP 1 FirstName FROM Customer ORDER BY newid()),
LastName= (SELECT TOP 1 LastNameFROM Customer ORDER BY newid()),
State = (SELECT TOP 1 State FROM Customer ORDER BY newid())

Thanks!

+1  A: 

You could select the top N random rows at once (where N=3 is the number of columns), and then take column 1 from row 1, column 2 from row 2, etc. I'm not sure exactly how to do that last step in SQL, but if you're willing to do the last step in some other language I'm sure it would be simple.

Also, by selecting N rows at once you would have the new property that you would never be selecting two columns from the same row (though this could cause trouble if there are more columns than rows).

mathmike
+1 I like the additional assurance that no real First/Last names would be paired with this technique.
LesterDove
+3  A: 

ORDER BY NEWID() works with ROW_NUMBER in SQL Server 2008. Not sure about SQL Server 2005,

This is needed to generate values to join the 3 separate queries: it's slightly counter intuitive because you'd think it would always take the first 100 rows in a different order but it doesn't...

;With F AS
(
  SELECT TOP 100
    FirstName, ROW_NUMBER() OVER (ORDER BY NEWID()) AS Foo
  FROM Customer
), L AS
(
  SELECT TOP 100
    LastName, ROW_NUMBER() OVER (ORDER BY NEWID()) AS Foo
  FROM Customer
), S AS
(
  SELECT TOP 100
    State, ROW_NUMBER() OVER (ORDER BY NEWID()) AS Foo
  FROM Customer
)
SELECT
   F.FirstName, L.LastName, S.State
FROM
   F
   JOIN L ON F.Foo = L.Foo
   JOIN S ON F.Foo = S.Foo
gbn
+1 Very nice use of NEWID and ROW_NUMBER
Meff
Something new; thanks. And yes it worked in 2005
LesterDove
A: 

It seems to me that you are actually trying to generate random data -- the fact that you already have a bunch that is non-random is really just a side note. If I were in your shoes, I would look at generating random customers by choosing random words from the dictionary to use as FName, LName, City, etc. That seems easier and more random anyway.

MJB
Thank you - that would indeed take care of the 'randonmess' issue. It's for a sample report, so I don't think a customer named Mr. Pogostick Zucchini would be appropriate. I suppose what I want is randomized customer data, to put it a better way. This is a deficiency in my original post, sorry.
LesterDove
Gotcha. Now I understand.
MJB