views:

88

answers:

2

Proper object disposal removed for brevity but I'm shocked if this is the simplest way to encode an object as UTF-8 in memory. There has to be an easier way doesn't there?

var serializer = new XmlSerializer(typeof(SomeSerializableObject));

var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);

serializer.Serialize(streamWriter, entry);

memoryStream.Seek(0, SeekOrigin.Begin);
var streamReader = new StreamReader(memoryStream, System.Text.Encoding.UTF8);
var utf8EncodedXml = streamReader.ReadToEnd();
+2  A: 

No, you can use a StringWriter to get rid of the intermediate MemoryStream. However, to force it into XML you need to use a StringWriter which overrides the Encoding property:

public class Utf8StringWriter : StringWriter
{
    public override Encoding Encoding
    {
         get { return Encoding.UTF8; }
    }
}

Then:

var serializer = new XmlSerializer(typeof(SomeSerializableObject));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
    serializer.Serialize(writer, entry);
    utf8 = writer.ToString();
}

Obviously you can make Utf8StringWriter into a more general class which accepts any encoding in its constructor - but in my experience UTF-8 is by far the most commonly required "custom" encoding for a StringWriter :)

Now as Jon Hanna says, this will still be UTF-16 internally, but presumably you're going to pass it to something else at some point, to convert it into binary data... at that point you can use the above string, convert it into UTF-8 bytes, and all will be well - because the XML declaration will specify "utf-8" as the encoding.

EDIT: A short but complete example to show this working:

using System;
using System.Text;
using System.IO;
using System.Xml.Serialization;

public class Test
{    
    public int X { get; set; }

    static void Main()
    {
        Test t = new Test();
        var serializer = new XmlSerializer(typeof(Test));
        string utf8;
        using (StringWriter writer = new Utf8StringWriter())
        {
            serializer.Serialize(writer, t);
            utf8 = writer.ToString();
        }
        Console.WriteLine(utf8);
    }


    public class Utf8StringWriter : StringWriter
    {
        public override Encoding Encoding
        {
            get { return Encoding.UTF8; }
        }
    }
}

Result:

<?xml version="1.0" encoding="utf-8"?>
<Test xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&gt;
  <X>0</X>
</Test>

Note the declared encoding of "utf-8" which is what we wanted, I believe.

Jon Skeet
Even when you override the Encoding parameter on StringWriter it still sends the written data to a StringBuilder, so it's still UTF-16. And the string can only ever be UTF-16.
Jon Hanna
@Jon: Have you tried it? I have, and it works. It's the *declared* encoding which is important here; obviously internally the string is still UTF-16, but that doesn't make any difference until it's converted to binary (which could use any encoding, including UTF-8). The `TextWriter.Encoding` property is used by the XML serializer to determine which encoding name to specify within the document itself.
Jon Skeet
I tried it and I got a string in UTF-16. Maybe that's what the querant wants.
Jon Hanna
@Jon: And what was the declared encoding? In my experience, that's what questions like this are *really* trying to do - create an XML document which declares itself to be in UTF-8. As you say, it's best not to consider the text to be in *any* encoding until you need to... but as the XML document *declares* an encoding, that's something you need to consider.
Jon Skeet
Yep, I've asked the querant to qualify. I read the question literally, but since the code he gives as an example produces a string maybe your read on it is correct (though in that case I'd suggest not having a declaration at all, since it would then be valid between UTF-8/UTF-16 re-encodings).
Jon Hanna
@Jon Hanna is there a way to serialize to XML without having a declaration at all?
Garry Shutler
@Garry, simplest I can think of right now is to take the second example in my answer, but when you create the `XmlWriter` do so with the factory method that takes an `XmlWriterSettings` object, and have the `OmitXmlDeclaration` property set to `true`.
Jon Hanna
@Jon Hanna Excellent, thanks very much.
Garry Shutler
+1  A: 

Your code doesn't get the UTF-8 into memory as you read it back into a string again, so its no longer in UTF-8, but back in UTF-16 (though ideally its best to consider strings at a higher level than any encoding, except when forced to do so).

To get the actual UTF-8 octets you could use:

var serializer = new XmlSerializer(typeof(SomeSerializableObject));

var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);

serializer.Serialize(streamWriter, entry);

byte[] utf8EncodedXml = memoryStream.ToArray();

I've left out the same disposal you've left. I slightly favour the following (with normal disposal left in):

var serializer = new XmlSerializer(typeof(SomeSerializableObject));
using(var memStm = new MemoryStream())
using(var  xw = XmlWriter.Create(memStm))
{
  serializer.Serialize(xw, entry);
  var utf8 = memStm.ToArray();
}

Which is much the same amount of complexity, but does show that at every stage there is a reasonable choice to do something else, the most pressing of which is to serialise to somewhere other than to memory, such as to a file, TCP/IP stream, database, etc. All in all, it's not really that verbose.

Jon Hanna