views:

178

answers:

5

Sometimes our project tree can have binary files, such as jpg, png, doc, xls, or pdf. Can GIT, Mercurial, SVN, or other tools do a good job when only part of a binary file is changed?

For example, if the spec is written in .doc and it is part of the repository, then if it is 4MB, and edited 100 times but just for 1 or 2 lines, and checked in 100 times during the year, then it is 400MB.

If it is 100 different .doc and .xls files, then it is 40GB... not a size that is easy to manage.

I have tried GIT and Mercurial and see that they both seem to add a big size of data even when 1 line is changed in a .doc or .pdf. Is there other way inside of GIT or Mercurial or SVN that can do the job?

+2  A: 

See the mercurial page about Binary files at http://mercurial.selenic.com/wiki/BinaryFiles. Your main problem is that even minor changes in files such as doc and others will triger large changes in the file structure (partly because it's zipped).

Therefore, I don't believe you'll find any nice way to handle these files in a version control system.

CFP
This is a valid point : it might be better to configure Word, Excel and Openoffice to save by default in their "bloated" xml based formats as there is more chance of SCM's to detect the differences.
Peter Tillemans
@Peter Tillemans: It's possible, at least with `git`, to set up a hook to run `tidy` on the XML data before committing it; this might increase the chances of reducing diffs. Though it might be necessary to install `cygwin` in order to get `tidy` under windows. This also assumes that the MS formats are consistent enough that it can read them after they've been `tidy`ed.
intuited
+4  A: 

There exist binary diff tools, however they don't help much, since the change in one pixel of an image, or a change of one character in a Word document, does not correspond to change of one byte in the file, due to compression. Thus "nice" handling of such binary data is impossible.

If you want to commit such documents, consider committing uncompressed variants - RTF instead of DOC, TeX instead of PDF, etc. If the version control system employs compression to compress its internal repository, then this method should work rather well. For instance, in Git,

Newly added objects are stored in their entirety using zlib compression.

EDIT: I just wanted to note that even RTF is horrible, but not as horrible as DOC. If you can switch to TXT or TeX for your documents, that would be best.

Amadan
Postscript is another alternative to TeX. As noted in another answer Word can save files in an XML format as well which would be possible to diff.
Matthew Talbert
+2  A: 

I have been using git to synchronize my Documents between Mac, Linux and Windows machines. I had to do one redesign to evade a 2Gb file limitation on Windows. In total it is around 7Gb in 3 repositories which are regularly synched. At a certain point I had even a remote copy on a hosted server on the internet somewhere.

Now I almost never need to clone these repos so the big size does not hinder a lot. I also see the .git not increasing significantly and it remains at around 40-60% of the size of the checked out docs, pdfs, excel sheets.

Changing a line in a doc ot pdf file, changes a lot in the file as the formatting effects ripple through. Similarly changing a cell in a XLS file can change a lot of other cells.

However, compared with the alternative of not having the documents under version control, I am happy to live with less than stellar compression ratios

Peter Tillemans
+7  A: 

In general, version control systems work better with text files. The whole merge/conflict concept is really based around source code. However, SVN works pretty well for binary files. (We use it to version CAD drawings.)

I will point out that the file locking (svn:needs-lock) are pretty much a must-have when there are multiple people working on a common binary file. Without file locking, it is possible for 2 people to work on a binary file at once. Someone commits their changes first. Guess what happens to the person that didn't commit. All of that binary/unmergable work they did is effectively lost. File-locking serializes work on the file. You do lose the "concurrent" access capabilities of a version control system, but you still have the benefits of a commit log, rolling back to a previous version, etc.

The TortoieSVN client is smart enough to use MS Word's built in merge tool to diff a doc/docx file. It also has configuration options to let you specify alternate diff tools based on file extension, which is pretty cool. (It's a shame no one has made a diff tool for our CAD package).

Current-generation DVCSes like Git or Hg tend to suck with binary files though. They don't have any sort of mechanism for file locking.

msemack
+1 for svn:needs-lock on binary files
JeremyP
A: 

IMHO, you should stop to use a SCM to manage documents like these. You should use dedicated tools like Alfresco (I'm sure there are many others tools for document management).

Alexandre Hamez