views:

59

answers:

6

So I've gotten a project and got the db team sold on source control for the db (weird right?) anyway, the db already exists, it is massive, and the application is very dependent on the data. The developers need up to three different flavors of the data to work against when writing SPROCs and so on.

Obviously I could script out data inserts.

But my question is what tools or strategies do you use to build a db from source control and populate it with multiple large sets of data?

+2  A: 

Good to see you put your database under source control.

We have our database objects in source control but not data (except for some lookup values). To maintian the data on dev, we refresh it by restoring the latest prod backup, then rerunning the scripts for any database changes. If what we were doing would require special data (say new lookup values that aren't on prod or test logins), we have a script for that as well which is part of source control and which would be run at the same time. You wouldn't want to script out all the data though as it would be very timeconsuming to recreate 10 million records through a script (And if you have 10 million records you certainly don't want developers developing against a database with ten test records!). Restoring prod data is much faster.

Since all our deployments are done only through source controlled scripts, we don't have issues getting people to script what they need. Hopefully you won't either. When we first started (and back when dev coudl do their own deployments to prod) we had to actually go through a few times and delete any objects that weren't in source control. We learned very quickly to put all db objects in source control.

HLGEM
Depending upon what our application does and what jurisdiction we operate under, using live data for development or testing purposes can put us in breach of data protection laws and/or cause various compliance issues.
APC
Well in that case, yes you may have to create scripts to fill in the tables. But you still want them to fill in roughly the same number of records as prod. Otherwise you will have queries that run fine for the devloper that timeout on prod. I'd consider creating a test database that is filled with the numer of records you need (but not real data) and then backing up and restoring it (restore is much faster than inserting records) rather than prod. I think there are some programs which can take real data and disguise it for test and dev databases.
HLGEM
+2  A: 

Usually, we only put in source control the .sql files for (re-) building the schema.

Then we put in source control the script able to read a production or integration database in order to extract and populate a relevant set of data in the database resulting from the previous .sql execution.

The idea is to get most recent data with a script robust enough to read them from a database which is not always at the same version than the one being build. (in reality though, the difference is never that big, and the data can easily be read).

VonC
A: 

Have a look at Visual Studio Team System, database edition - either the 2008 with the GDR 2 download, or 2010. It can handle schema versioning with full integration into source control, and can handle test data generation (like random names etc.).

I personally like it - I dot development using Management Studio, then fire up Visual Studio and sync the changes down to the project, from where they get then synced up to production.

I use that for my development. I do not script out production data - my maiin database has about 300gb right now, and I have a table approaching half a billion rows. I have a development server a copy of the data is sometimes loaded when needed. Developers work against smal ltest data or the dev server (not many people here).

Initial data is being maintained by stored procedures or special upload / validation scripts that run as part of the process and check elements like lookup tables.

TomTom
A: 

I have used several strategies in the past, but here is one that works:

  • Create the DDL (database creation) script(s) and check them into source control
  • Create a separate data population script (or set of scripts) for each of your configurations
  • Set up the automated build to create separate deployment packages for each configuration, which will include the appropriate scripts

If you are dealing with a system that is in production and has data that you can't wipe out:

  • Create a separate set of "upgrade/patch" scripts that does updates to the schema (alter tables, create or replace procs, etc. for objects that already exist on the deployment target)
  • Create "insert" scripts, if required, for any data that needs to be populated for each build; these should be 're-runnable' (not mess up the data if the patch is deployed twice)
Guy Starbuck
A: 

For tables that hold configuration type data (not transactional) we use Excel. We insert a VBA script into the spreadsheet to handle the save event and have it spit out sql insert statements upon save. Business analysts and customers love Excel so this technique works great for them to give us predefined scenarios for us to test with. We usually Source Control the output .sql file so we can use it to load data for automated builds and the Excel file goes in the Team SharePoint site.

DancesWithBamboo
A: 

This is best handled as two separate subjects. On the one hand, you want a solid and consistant database schema (tables, indexes, vies, procedures, functions, and also lookup values and any non-changing "static" data required by your system), and you want version control over that so you can track what changes over time (and by who) and also can control when the changes get applied to which database instances. Prior posts to this question have covered this subject well.

On the other hand, you will need the database populated with data against which you can test and devlop new code. Defining and loading such data is not the same as defining and loading the structures that will hold it. While managing database definitions via source control can be a readily solved problem, over the past many years I have never heard of an equally simple (well, relatively simple) solution for addressing the data problem. Aspects of the problem include:

  • Make sure there's enough data. Adding 10-20 rows per table is easy, but you can't possible predict performance if your live databases will contain millions of rows or more.

  • A quick and easy solution is to get a copy of the lastest Production database, update it with the recent changes, and off you go. This can be tricky if the development environment doesn't have a SAN upon which to host a copy of the multi-TB of Production data you're supporting

  • Similarly, the SOX and/or HIPAA auditors might not want extra copies of potentially confidential data sitting on not-so-secure development servers (in front of not-so-secure developers--we are a shifty bunch, after all). You might need to scramble or randomize sensitive data before making it available to developers... which implies an interim "scrambler" process to sanitize the data. (Perhaps another SAN for all those TB?)

  • In some situations, it'd be ideal for some department or other to provide you with a correct, coherent, and coordinated set of data to do development against -- something they make up to cover all likely possible situations, and that they could use for testing on their side (knowing what goes in, they know what should be coming out, and can check for it). Of course the effort to create such a set of data is substantial, and convincing non-IT groups to provide such data sets may be politically impossible. But it's a nice dream.

  • And of course the data changes. After you've worked the copy over in development for a week, a month, a quarter, eventually and inevitably you will discover that the Production data doesn't "look" like that any more -- usage patterns will have changed, averages of significant values will drift, all your dates will be old and irrelevant... whatever, you'll need to get fresh data all over again.

It's an ugly problem with no simple solution that I've ever heard of. One possibility that could help: I recall reading articles in the past of products that can be used to "stuff" a database with made up yet statistically relevant data. You specify things like "10,000 rows in this table, this col is an identity primary key, this tinyint ranges from 1-10 with equal distribution, this varchar ranges from 6 to 30 characters with maybe 2% duplicates", and so forth. Something like this might be invaluable, but it all depends upon t he circumstances in which you find yourself.

Philip Kelley