views:

95

answers:

2

The following hunk of code (snipped for brevity) generates an xml doc, and spits it out to a file. If I open the file in Visual Studio it appears to be in chinese characters. If I open it in Notepad it looks as expected. If I Console.WriteLine it look correct.

I know it's related to encoding, but I though I had all the encoding ducks in a row. What's missing?

StringBuilder stringBuilder = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.Unicode;
settings.Indent = true; 
settings.IndentChars = "\t";
using (XmlWriter textWriter = XmlWriter.Create(new StringWriter(stringBuilder), settings))
{
    textWriter.WriteStartElement("Submission");
    textWriter.WriteAttributeString("xmlns", "xsi", null, "http://www.w3.org/2001/XMLSchema-instance");
    textWriter.WriteEndElement();
}

using (StreamWriter sw = new StreamWriter(new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None)))
            {
                sw.Write(stringBuilder.ToString());
            }
+2  A: 

The problem is that you're writing it to disk using UTF-8, but it will claim to be UTF-16 because that's what a StringWriter uses by default - and because you're explicitly setting it to use Encoding.Unicode as well.

The simplest way to fix this is to use a StringWriter which advertises itself as UTF-8:

public class Utf8StringWriter : StringWriter
{
    public override Encoding
    {
         get { return Encoding.UTF8; }
    }
}

... and then remove the settings.Encoding = Encoding.Unicode line. That way you'll use UTF-8 throughout. (In fact, the Encoding property of XmlWriterSettings is ignored when you create the XmlWriter with a TextWriter anyway.)

If you really want UTF-16, then when you create the StreamWriter, specify Encoding.Unicode there too.

Jon Skeet
So the result of the StringWriter is a 16bit unicode string and that will be encoded with UTF-8 and then written to disk?
Aaron Digulla
Well, the result of StringWriter will be unicode string whatever happens, because that's what .NET uses as its string format. The problem is that the XML declaration at the start of the file will claim that it's using UTF-16, even though it's really using UTF-8.
Jon Skeet
I ended up adding Encoding.Unicode as the second paramter on the StringWriter constructor. That seems to have done the trick. How does that differ from your approach of deriving from StringWriter.
Ralph Shillington
Basically you're now using UTF-16 everywhere, whereas I was suggesting using UTF-8 everywhere. Both will work, but your file will probably be twice as big and if you use any editors which assume ASCII, they'll get confused.
Jon Skeet
A: 

I'm not sure what Encoding.Unicode is but I guess it's UTF-16 which writes two bytes per character into the file. For normal ASCII text, one of the bytes is always 0.

Try UTF-8 instead. This should look the same in any editor unless you use special characters (with a code point >= 128).

Aaron Digulla