views:

196

answers:

4

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.

To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).

For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.

What I'm trying to figure out is this: how do I make the SQL Server send such data?

Do I:

  1. Create a stored procedure that has to be run beforehand to fill an actual table?
  2. Create a function that I point the view to, thus having the server generate the data 'live'?
  3. Somehow replicate and/or randomize existing data?

The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.

The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.


In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.

A: 

If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.

How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.

Rob Walker
A: 

The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).

To populate the rows, perhaps something like this would do:

WHILE (SELECT count(1) FROM benchmark) < 100000
  INSERT INTO benchmark
  SELECT TOP 100000 * FROM actualData
Sören Kuklau
+2  A: 

For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.

SqlACID
+1  A: 

You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.

Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.

Brent Ozar
Good point. I've re-tagged my post; we're still using 2005.
Sören Kuklau