How to Manage a dataset together with an application?

views:

160

answers:

+12 Q:

How to Manage a dataset together with an application?

The application's code and configuration files are maintained in a code repository. But sometimes, as a part of the project, I also have a some data (which in some cases can be >100MB, >1GB or so), which is stored in a database. Git does a nice job in handling the code and its changes, but how can the development team easily share the data?

It doens't really fit in the code version control system, as it is mostly large binary files, and would make pulling updates a nightmare. But it does have to be synchronised with the repository, because some code revisions change the schema (ie migrations).

How do you guys handle such situations?

+2 A:

We usually use the database sync or replication schema.

Each developer has 2 copies of the database, one for working and the other just for keeping the sync version.

When the code is synchronized, the script syncs the database too (the central DB against the "dead" developer's copy). After that each developer updates his own working copy. Sometimes a developer needs to keep some of his/her data, so these second updates are not always driven by the standard script.

It is as robust as the replication schema .... sometimes (depending on the DB) that doesn't represent good news.

belisarius 2010-08-27 22:38:23

+3 A:

We have the data and schema stored in xml and use liquibase to handle the updates to both the schema and the data. The advantage here is that you can diff the files to see what's going on, it plays nicely with any VCS and you can automate it.

Due to the size of your database this would mean a sizable "version 0" file. But, using the migration strategy, after that the updates should be manageable as they would only be deltas. You might be able to convert your existing migrations one-to-one to liquibase as well which might be nicer than a big-bang approach.

You can also leverage @belisarius' strategy if your deltas are very large so each developer doesn't have to apply the delta individually.

StevenWilkins 2010-08-28 14:24:17

+3 A:

It seems to me that your database has a lot of parallels with a binary library dependency: it's large (well, much larger than a reasonable code library!), binary, and has its own versions which must correspond to various versions of your codebase.

With this in mind, why not integrate a dependency manager (e.g. Apache Ivy) with your build process and let it manage your database? This seems like just the sort of task that a dependency manager was built for.

Regarding the sheer size of the data/download, I don't think there's any magic bullet (short of some serious document pre-loading infrastructure) unless you can serialize the data into a delta-able format (the XML/JSON/SQL you mentioned).

A second approach (maybe not so compatible with dependency management): If the specifics of your code allow it, you could keep a second file that is a manual diff that can take a base (version 0) database and bring it up to version X. Every developer will need to keep a clean version 0. A pull (of a version with a changed DB) will consist of: pull diff file, copy version 0 to working database, apply diff file. Note that applying the diff file might take a while for a sizable DB, so you may not be saving as much time over the straight download as it first seems.

Greg Harman 2010-08-31 05:30:10

Thanks Greg, your first solution sounds cool. I'll check it out. Your second solution - db data migrations - is very nice in theory, but I couldn't get it to work in real life (I'm currently using it, and that's the reason for my question). It'll take longer than a comment to explain why. I should probably write a blog post about it :)

Ofri Raviv 2010-08-31 08:59:37

@Ofri, sure that makes sense. Still might be useful for the general case - it's what I find myself doing most often in similar situations. Unfortunately, I think that there are going to be some big downloads and no way around it... good thing we've moved past dial-up. :-)

Greg Harman 2010-08-31 14:07:55

I think the key part is "which must correspond to various versions of your codebase." . We found that not all developers agree to this, as they data for unit testing (for example) evolves too quickly ... much faster than their expected code commits. Just my 2c

belisarius 2010-08-31 23:24:26

@belisarius If developers are committing code changes that require changes to the testing framework (i.e. the data) then I'd say those commits constitute a new revision that could (should?) be tagged as such. Now one concern could be that a single monolithic database covers testing ground for many unrelated pieces of the code. In this case, I'd say the database needs to be redesigned and broken up into smaller segments that align with a compact area of the code.

Greg Harman 2010-09-01 00:33:56

@Greg Harman. Agree. I just tried to point out that the sync is not a trivial problem

belisarius 2010-09-01 00:43:31

ansaurus

tags:

views:

answers:

How to Manage a dataset together with an application?

related questions