views:

1381

answers:

8

I'm doing a web app, and I need to make a branch for some major changes, the thing is, these changes require changes to the database schema, so I'd like to put the entire database under git as well.

How do I do that? is there a specific folder that I can keep under a git repository? How do I know which one? How can I be sure that I'm putting the right folder?

I need to be sure, because these changes are not backward compatible; I can't afford to screw up.

The database in my case is PostgreSQL

Edit:

Someone suggested taking backups and putting the backup file under version control instead of the database. To be honest, I find that really hard to swallow.

There has to be a better way.

Update:

OK, so there' no better way, but I'm still not quite convinced, so I will change the question a bit:

I'd like to put the entire database under version control, what database engine can I use so that I can put the actual database under version control instead of its dump?

Would sqlite be git-friendly?

Since this is only the development environment, I can choose whatever database I want.

Edit2:

What I really want is not to track my development history, but to be able to switch from my "new radical changes" branch to the "current stable branch" and be able for instance to fix some bugs/issues, etc, with the current stable branch. Such that when I switch branches, the database auto-magically becomes compatible with the branch I'm currently on. I don't really care much about the actual data.

+11  A: 

Take a database dump, and version control that instead. This way it is a flat text file.

Personally I suggest that you keep both a data dump, and a schema dump. This way using diff it becomes fairly easy to see what changed in the schema from revision to revision.

If you are making big changes, you should have a secondary database that you make the new schema changes to and not touch the old one since as you said you are making a branch.

X-Istence
What? There's gotta be a better way.
hasen j
PostGreSQL database files are binary files, feel free to put them in your git repository, you just won't be able to do any diffs on them, and any changes will most likely change the whole database and thus you now have to send the full database over the wire to your git repo and store it.This is inefficient, slow, and makes it extremely hard to work with. Also, I am not sure that the database files stored on disk without VACUUM and shutting PostgreSQL down to make a copy are "stable" as in all of the data is always correct, thereby possibly leaving you with corrupt data.
X-Istence
Hmm, I see! Well, are there db systems that are more git-friendly?
hasen j
This type of solution is pretty standard and the schema *is* actually source code.
Dana the Sane
A: 

You can't do it without atomicity, and you can't get atomicity without either using pg_dump or a snapshotting filesystem.

My postgres instance is on zfs, which I snapshot occasionally. It's approximately instant and consistent.

Dustin
+5  A: 

Check out Refactoring Databases (http://databaserefactoring.com/) for a bunch of good techniques for maintaining your database in tandem with code changes.

Suffice to say that you're asking the wrong questions. Instead of putting your database into git you should be decomposing your changes into small verifiable steps so that you can migrate/rollback schema changes with ease.

If you want to have full recoverability you should consider archiving your postgres WAL logs and use the PITR (point in time recovery) to play back/forward transactions to specific known good states.

Paul Lindner
A: 

I think X-Istence is on the right track, but there are a few more improvements you can make to this strategy. First, use:

$pg_dump --schema ...

to dump the tables, sequences, etc and place this file under version control. You'll use this to separate the compatibility changes between your branches.

Next, perform a data dump for the set of tables that contain configuration required for your application to operate (should probably skip user data, etc), like form defaults and other data non-user modifiable data. You can do this selectively by using:

$pg_dump --table=.. <or> --exclude-table=..

This is a good idea because the repo can get really clunky when your database gets to 100Mb+ when doing a full data dump. A better idea is to back up a more minimal set of data that you require to test your app. If your default data is very large though, this may still cause problems though.

If you absolutely need to place full backups in the repo, consider doing it in a branch outside of your source tree. An external backup system with some reference to the matching svn rev is likely best for this though.

Also, I suggest using text format dumps over binary for revision purposes (for the schema at least) since these are easier to diff. You can always compress these to save space prior to checking in.

Finally, have a look at the postgres backup documentation if you haven't already. The way you're commenting on backing up 'the database' rather than a dump makes me wonder if you're thinking of file system based backups (see section 23.2 for caveats).

Dana the Sane
isn't the dump just a backup?
hasen j
Yes, but you can restore it to an alternate database and make your modifications there.
Dana the Sane
A: 

I'm starting to think of a really simple solution, don't know why I didn't think of it before!!

  • Duplicate the database, (both the schema and the data).
  • In the branch for the new-major-changes, simply change the project configuration to use the new duplicate database.

This way I can switch branches without worrying about database schema changes.

EDIT:

By duplicate, I mean create another database with a different name (like my_db_2); not doing a dump or anything like that.

hasen j
This is what I was trying to get at. Sorry if I wasn't clear enough.
Dana the Sane
+1  A: 

Use something like LiquiBase: http://www.liquibase.org/home this lets you keep revision control of your Liquibase files. you can tag changes for production only, and have lb keep your DB up to date for either production or development, (or whatever scheme you want).

Craig
awesome find man!
hasen j
Liguibase's best practices recommend keeping schema creation scripts as a set of sequential scripts to be ran in order. While this is a good best practice I don't see how it would works without a central repository, which is un-GIT.
Frank Schwieterman
Well, it would work across git just fine if you are careful about your id= and author= tags. In theory each user would have their own author entry (GOOD) and if you do something reasonable with id=, say YYYYMMDD_REV, then you are pretty much good to go. Even with git, most everyone has a 'central repo', for a given project. 99% of people don't have somethiing 'central'. Again, Liquibase files are just plan text XML-ish files, with a stack of commands to execute against a given DB (or set of). Chances are 99% of all projects would have 0 issues following this in practice, even with DVCS's.
Craig
A: 

This question is pretty much answered but I would like to complement X-Istence's and Dana the Sane's answer with a small suggestion.

If you need revision control with some degree of granularity, say daily, you could couple the text dump of both the tables and the schema with a tool like rdiff-backup which does incremental backups. The advantage is that instead of storing snapshots of daily backups, you simply store the differences from the previous day.

With this you have both the advantage of revision control and you don't waste too much space.

In any case, using git directly on big flat files which change very frequently is not a good solution. If your database becomes too big, git will start to have some problems managing the files.

Unode
A: 

Use a tool like iBatis Migrations (manual, short tutorial video) which allows you to version control the changes you make to a database throughout the lifecycle of a project, rather than the database itself.

This allows you to selectively apply individual changes to different environments, keep a changelog of which changes are in which environments, create scripts to apply changes A through N, rollback changes, etc.

matt b