How do I diff utf-16 files with GNU diff? | ansaurus

tags:

views:

619

answers:

3

+4 Q:

How do I diff utf-16 files with GNU diff?

GNU diff doesn't seem to be smart enough to detect and handle UTF-16 files, which surprises me. Am I missing an obvious command-line option? Is there a good alternative?

+2 A:

From the GNU diff documentation:

Handling Multibyte and Varying-Width Characters

diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

The IBM GNU/Linux Technology Center Internationalization Team has proposed some patches to support internationalized diff http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.

I never realized that myself.

It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:

http://www.guiffy.com/Diff-Tool.html

danieltalsky 2009-04-22 17:24:32

Reflects the long tradition of UNIX tools to treat characters and bytes as equal which only recently began to break down a little. Subversion is also a widely used tool which can't treat UTF-16 as text.

Joey 2009-04-22 17:27:14

A:

You could maybe build something in python with the excellent chardet, then convert your files to UTF-8 and send this to GNU diff ?

http://chardet.feedparser.org/

bsergean 2009-04-30 07:07:15

I think if I were going to go to that much trouble, I'd use Perl, since I know it. :)

skiphoppy 2009-05-02 01:01:41

A:

vimdiff works quite nicely for this purpose.

I found it while reading this StackOverflow answer.

Jean Regisser 2009-11-13 11:32:04

related questions

Difference between VARCHAR2(11 BYTE) and VARCHAR2(11 CHAR)

How to remove these kind of symbols (junk) from string?

Problem with unicode String literal in unit test

How to replace a character programatically in Oracle 8.x series

Best way to convert text files between character sets?

international characters in Javascript

What do I need to know to globalize an asp.net application?

Are you fluent in Unicode yet?

Unicode in C++

Getting international characters from a web page?

MySQL UTF/Unicode migration tips

Reading Email using Pop3 in C#

cross platform unicode support

Formatting tabular data using unicode characters

'Reliable' SMS Unicode & GSM Encoding in PHP

How do I put unicode characters in my Antlr grammar?

Are named entities in HTML still necessary in the age of Unicode aware browsers?

Unicode vs UTF-8 confusion in Python / Django?

Regex and unicode

How can I get Unicode characters to display properly for the tooltip for the IMG ALT in IE7?

String To Lower/Upper in C++

How to display unicode text in OpenGL?

Is it just me, or are characters being rendered incorrectly more lately?

Python, Unicode, and the Windows console

Internationalization in your projects