views:

2214

answers:

4

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16.

Can git be taught to recognize that this file is text and handle it appropriately?

I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary.

+4  A: 

By default, it looks like git won't work well with UTF-16; for such a file you have to make sure that no CRLF processing is done on it, but you want diff and merge to work as a normal text file (this is ignoring whether or not your terminal/editor can handle UTF-16).

But looking at the .gitattributes manpage, here is the custom attribute that is binary:

[attr]binary -diff -crlf

So it seems to me that you could define a custom attribute in your top level .gitattributes for utf16 (note that I add merge here to be sure it is treated as text):

[attr]utf16 diff merge -crlf

From there you would be able to specify in any .gitattributes file something like:

*.vmc utf16

Also note that you should still be able to diff a file, even if git thinks it's binary with:

git diff --text

Edit

This answer basically says that GNU diff wth UTF-16 or even UTF-8 doesn't work very well. If you want to have git use a different tool to see differences (via --ext-diff), that answer suggests Guiffy.

But what you likely need is just to diff a UTF-16 file that contains only ASCII characters. A way to get that to work is to use --ext-diff and the following shell script:

#!/bin/bash

TMPFILE1=`mktemp /tmp/$(basename $1).XXXXXX` || exit 1
TMPFILE2=`mktemp /tmp/$(basename $2).XXXXXX` || exit 1

iconv -f utf-16 -t utf-8 $1 > $TMPFILE1
iconv -f utf-16 -t utf-8 $2 > $TMPFILE2

diff $TMPFILE1 $TMPFILE2

rm -f $TMPFILE1
rm -f $TMPFILE2

Note that converting to UTF-8 might work for merging as well, you just have to make sure it's done in both directions.

As for the output to the terminal when looking at a diff of a UTF-16 file:

Trying to diff like that results in binary garbage spewed to the screen. If git is using GNU diff, it would seem that GNU diff is not unicode-aware.

GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).

Jared Oberhaus
Trying to diff like that results in binary garbage spewed to the screen. If git is using GNU diff, it would seem that GNU diff is not unicode-aware.
skiphoppy
GNU diff doesn't really care about unicode, so when you use diff --text it just diffs and outputs the text. The problem is that the terminal you're using can't handle the UTF-16 that's emitted (combined with the diff marks that are ASCII characters).
Jared Oberhaus
+2  A: 

Have you tried setting your .gitattributes to treat it as a text file?

eg.

*.vmc set diff

More details: http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html

Chealion
+2  A: 

Solution is to filter through cmd.exe /c "type %1". cmd's type builtin will do the conversion, and so you can use that with the textconv ability of git diff to enable text diffing of UTF-16 files (should work with UTF-8 as well, although untested).

Quoting from gitattributes man page:

==============================

Performing text diffs of binary files

Sometimes it is desirable to see the diff of a text-converted version of some binary files. For example, a word processor document can be converted to an ASCII text representation, and the diff of the text shown. Even though this conversion loses some information, the resulting diff is useful for human viewing (but cannot be applied directly).

The textconv config option is used to define a program for performing such a conversion. The program should take a single argument, the name of a file to convert, and produce the resulting text on stdout.

For example, to show the diff of the exif information of a file instead of the binary information (assuming you have the exif tool installed), add the following section to your $GIT_DIR/config file (or $HOME/.gitconfig file):

[diff "jpg"]
textconv = exif

==============================

A solution for mingw32, cygwin fans may have to alter the approach. The issue is with passing the filename to convert to cmd.exe - it will be using forward slashes, and cmd assumes backslash directory separators.

Step 1:

Create the single argument script that will do the conversion to stdout. c:\path\to\some\script.sh:

#!/bin/bash
SED='s/\//\\\\/g'
FILE=\echo $1 | sed -e "$SED"\
cmd.exe /c "type $FILE"

Step 2:

Set up git to be able to use the script file. Inside your git config (~/.gitconfig or .git/config or see man git-config), put this:

[diff "cmdtype"]
textconv = c:/path/to/some/script.sh

Step 3:

Point out files to apply this workarond to by utilizing .gitattributes files (see man gitattributes(5)):

*vmc diff=cmdtype

then use git diff on your files.

+7  A: 

I've been struggling with this problem for a while, and just discovered (for me) a perfect solution:

$ git config --global diff.tool vimdiff      # or merge.tool to get merging too!
$ git difftool commit1 commit2

git difftool takes the same arguments as git diff would, but runs a diff program of your choice instead of the built-in GNU diff. So pick a multibyte-aware diff (in my case, vim in diff mode) and just use git difftool instead of git diff.

Find "difftool" too long to type? No problem:

$ git config --global alias.dt difftool
$ git dt commit1 commit2

Git rocks.

Sam Stokes