views:

28

answers:

2

In a test suite, I had tests that deal with unicode scattered about in various modules. I have now consolidated them into a single test class.

The .cs source modules that no longer have any unicode in them, remain unicode-encoded, and as a result are 2x their required size. I'd like to convert them back to ASCII, to save the space, and improve load times for these files in editors and tools.

Q1. Will this break my diffs? I currently use Kdiff3 on my workstation, but I'm more interested in the historical diff record for the source modules as generated by TFS.

Q2. Is there anything else I need to be aware of w.r.t. source management when converting a module from Unicode to ASCII ?

My particular situation is .NET and TFS, but I think the question might be applicable to just about any source-code-control system and programming language.

+1  A: 

Why not convert everything to UTF-8? It can handle everything UTF-16 can (which is obviously what you mean by "Unicode"), but ASCII characters will take up only one byte each, just like ASCII. And you won't have to worry about some of your files being in different encodings than others. If your diff tool decodes the files to a common encoding first, it shouldn't break your old diffs.

Converting UTF-16 to ASCII is a very bad idea. You say there's nothing but ASCII in those files, but if you're wrong, the non-ASCII characters will be lost. That is, unless you use something like Java's native2ascii utility, which converts non-ASCII characters to Unicode escapes (for example, Ã -> \u00C3), but that would definitely break your diffs.

Alan Moore
+1  A: 

Odd that it got converted to UTF-16. But it is easy enough to fix from Visual Studio 2008. Use File + Save As, keep the same name, click on the arrow on the Save button and choose Save with Encoding. Click on the "Encoding" combobox and select UTF8. That's the default encoding used by VS2008.

The resulting file has a BOM, just like your UTF-16 version had. That should be good enough for any reasonably modern diff tool, including KDiff3. They'll decode the text in the source code file back to Unicode. Test this on a couple of files to make sure.

Hans Passant