tags:

views:

45

answers:

3

My source XML has the copyright character in it as ©. When writing the XML with this code:

var stringWriter = new StringWriter();
segmentDoc.Save(stringWriter);
Console.WriteLine(stringWriter.ToString());

it is rendering that copyright character as a little "c" with a circle around it. I'd like to preserve the original code so it gets spit back out as ©. How can I do this?

Update: I also noticed that the source declaration looks like <?xml version="1.0" encoding="utf-8"?> but my saved output looks like <?xml version="1.0" encoding="utf-16"?>. Can I indicate that I want the output to still be utf-8? Would that fix it?

Update2: Also, &#x00A0; is getting output as ÿ. I definitely don't want that happening!

Update3: &#x00A7; is becoming a little box and that is wrong, too. It should be §

+2  A: 

I strongly suspect you won't be able to do this. Fundamentally, the copyright sign is &#x00A9; - they're different representations of the same thing, and I expect that the in-memory representation normalizes this.

What are you doing with the XML afterwards? Any sane application processing the resulting XML should be fine with it.

You may be able to persuade it to use the entity reference if you explicitly encode it with ASCII... but I'm not sure.

EDIT: You can definitely make it use a different encoding. You just need a StringWriter which reports that its "native" encoding is UTF-8. Here's a simple class you can use for that:

public class Utf8StringWriter : StringWriter
{
    public override Encoding Encoding
    {
         get { return Encoding.UTF8; }
    }
}

You could try changing it to use Encoding.ASCII as well and see what that does to the copyright sign...

Jon Skeet
I'm writing a tool that runs through an xml file and adds attributes that, according to a set of business rules that I have, are missing or invalid. Then, I want to spit the new xml back out. I do not have control over the system that ultimately reads the xml, so I wanted my own footprint to be the absolute minimum. I do not know if this other system is sane. I want to assume that it isn't.
Chris
Added more information beside "Update"
Chris
@Chris: Edited due to the update. I wouldn't immediately assume that the reading system is badly written unless you have reason to. Can you not try the existing output first? It really should be equivalent.
Jon Skeet
Your Utf8StringWriter fixed the declaration output. Thank you! I am still trying to find a way to preserve the special characters, though. I did 2 more updates showing some others that seem really wrong, whereas the copyright symbol seems like it could be right.
Chris
I tried ASCII and every other encoding I could think of and still had no luck.
Chris
@Chris: I suspect the problem you're seeing in terms of other characters is that you're opening the file with a text editor which is assuming a different character encoding. What happens if you open it in a genuine XML editor?
Jon Skeet
Good idea. If I open the output xml in XMLSpy, it pops up a dialog that says "Your file contains 1 character(s) that should not be present in a file using the Unicode UTF-8 encoding... The offending characters are `ÿ (0xFF)`. It is referring to what gets swapped in for that ` `
Chris
@Chris: Hmm. What does that bit of the file look like in binary, out of interest?
Jon Skeet
I used the class you specified to get the declaration to be accurate. Then I did a crazy hack to preserve all special characters, using regular expressions.
Chris
A: 

Maybe you can try to diffent document encoding, check out: http://www.sagehill.net/docbookxsl/CharEncoding.html

Ivo
A: 
kbrimington
There are bunches of other ones, `¶` and ` ` to name 2 more.
Chris