tags:

views:

26

answers:

2

We're inherting a project at work from another office that has closed down. The production database is around 150GB and we're shying away from copying this to 4 dev machines to work from. Are there any scripts, utilities or suggestions on how we can go about capturing a small subset of this data, say 5%, to work with in development - while maintaining integrity of the relationships, key tables, etc?

I guess what I mean by that last part is that if I had an orders table of 500 rows and took a random sampling of 25 rows, I would need to make sure that the 5% of products I took from the products table included any prodcuts need to satisfy those orders..... exceeding 5% if necessary.

I hope I explained that well enough. Anyone have any thoughts?

+1  A: 

At the risk of sounding like a pimp for third party products, have you thought about using a product like Hyperbac's? It allows you to restore the database onto your dev machine, but in a compressed - but performant - manner.

It's Hyperbac Online that is probably most relevant:

http://www.hyperbac.com/online/overview.asp

Peter Schofield
Thanks for the suggestion. Don't worry about pimping a product, I had anticipated getting suggestions for tools which is perfectly fine. :)
WesleyJohnson
+1  A: 

I suppose the first step would be to map out what the dependencies / relationships between tables are, and how you find all the dependencies of a given row in a given table.

Once you've done then then you could just take a random sampling of one of your high level tables (e.g. "Customers") and recursively fetch any dependent rows from dependent tables.

Rinse and repeat for any tables that didn't appear in the "dependency heirachy" for the first table that you chose, until you have a sampling from all tables.

There certainly isn't going to be a generic script to do this, but I'd say that time spent mapping out the dependencies in the database in this way is time well spent understanding the structure of the database.


Tbh I'd probably do the reverse instead - empty the database and add records to relevant tables as you find it necessary. There isn't really any need for developers to always run against a representative sampling of data, and really you should make sure that you regularly test against the full sampling of data anyay, just in case the 95% of the database thats left behind contains the rows that cause problems.

Kragen
Thanks Kragen, I had imagine it would come down to something like this and I had initially wanted to avoid it. But, as you've mentioned, the insight I'll gain into how the database works by doing this is definitely a benefity. We'll probably go this route.
WesleyJohnson