tags:

views:

374

answers:

3

I'm having a stupid problem. I'm reading some .cs files from disk. Doing lots of regex and other operations on them with a .net program i've made. Then write them back to disc.

The resulting files get the wrong encoding somehow. What encoding are c# source files? And then there is the first byte-order thing, is that needed? Does it get written when i use File.WriteAllText()?

The program changing the files is a simple .net application, and the code is simply

        string text = System.IO.File.ReadAllText(fn);
        string newText = Regex.Replace(text, regexStr, replaceStr);
        System.IO.File.WriteAllText(fn, newText);

The c# files have comments and strings don't seem to be part of the standard codepage.

One of the problematic characters is "ä"

Solution:

this seems to work correctly

        string text = System.IO.File.ReadAllText(fn, Encoding.GetEncoding(1252));
        string newText = Regex.Replace(text, regexStr, replaceStr);
        System.IO.File.WriteAllText(fn, newText, Encoding.GetEncoding(1252));
A: 

I've written a few code gens in my time and always used ASCII encoding (plain windows text). What language are you using to do the regex ops on the CS files?

Mauro
+1  A: 

By default the files should be encoded with the same code page that is set in the regional settings of the machine. By default this will be 'Unicode (UTF-8 with signature) - Codepage 65001' you can use any code page you wish, for example you could also use 'Western European (windows) - Codepage 1252'.

Fraser
+1  A: 

System.IO.File.ReadAllText(fn) tries to guess the encoding of the input file. This can go horribly wrong.

Visual Studio 2008 creates files by default in UTF-8. Similarly you should try to use UTF-8 where ever possible, by specifying Encoding.UTF8Encoding when writing the files to disk.

David Schmitt