tags:

views:

341

answers:

2

We're using DataContractSerializer to serialize our data to XML. Recently we found a bug with how the string "\r\n" gets saved and read back - it was turned into just "\n". Apparently, what causes this is using an XmlWriter with Indent = true set:

// public class Test { public string Line; }

var serializer = new DataContractSerializer(typeof(Test));

using (var fs = File.Open("C:/test.xml", FileMode.Create))
using (var wr = XmlWriter.Create(fs, new XmlWriterSettings() { Indent = true }))
    serializer.WriteObject(wr, new Test() { Line = "\r\n" });

Test test;
using (var fs = File.Open("C:/test.xml", FileMode.Open))
    test = (Test) serializer.ReadObject(fs);

The obvious fix is to stop indenting XML, and indeed removing the "XmlWriter.Create" line makes the Line value roundtrip correctly, whether it's "\n", "\r\n" or anything else.

However, the way DataContractSerializer writes it still doesn't seem to be entirely safe or perhaps even correct - for example, just reading the resulting file with XML Notepad and saving it again destroys both "\n" and "\r\n" values completely.

What is the correct approach here? Is using XML as a format for serializing binary data a flawed concept? Are we wrong to expect that tools like XML Notepad won't break our data? Do we need to augment each and every string field that could contain such text with some special attribute, perhaps something to force CDATA?

+3  A: 

Potentially you could use a CDATA, but I do agree with your summary that using XML for serialising binary data is just plain wrong. Can you communicate the data another way?

Noon Silk
So would you say that using DataContractSerializer and expecting to get the exact same data you saved is a bug?
romkyns
I suspect it can't be called a bug until you check to see if a CDATA section handles it. Line-breaks are an edge case, because obviously a line break on your system isn't necessarily the same as on mine, so I can forgive an implementation like this. I'd try forcing the CDATA approach.
Noon Silk
Can't find any way to tell DataContractSerializer to use CDATA...
romkyns
Alright, for the record, in the end we stopped indenting the XML. Pretty ugly; what one gets for using XML as a DATA storage format where it's clearly a DOCUMENT MARKUP format.
romkyns
+1  A: 

Why is it important to distinguish between a string containing '\r\n' and an empty string? In general, when using data contract serialization you don't care about the XML format/structure or how it stores the data as long as it "round-trips" correctly.

This is how we use it:

DataContractSerializer serializer = CreateSerializer(this.GetType());
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
using (XmlWriter writer = XmlTextWriter.Create(sb, settings))
{
   serializer.WriteObject(writer, this);
   writer.Flush();
}


internal static T Deserialize<T>(Stream stream)
{
   DataContractSerializer serializer = CreateSerializer(typeof(T));
   return (T)serializer.ReadObject(stream);
}

public static DataContractSerializer CreateSerializer(Type type)
{
   DataContractSerializer serializer = new DataContractSerializer();
   return serializer;
}

If I'm not mistaken, characters like linefeeds are not allowable characters within an XML value and would need to be either encoded or contrained in a CDATA section. The data contract serializer does neither of these. Tools like XML Notepad are changing the data because they realize these aren't legal characters and removing them to create conformant XML.

It actually shouldn't be surprising that string data can be returned differently between a binary serializer and an XML serializer. The binary serializer will serialize the exact binary representation of the data bit for bit and has no "rules" on what are legal characters, etc.

Scott Dorman
>>> "Why is it important to distinguish" - sometimes it isn't, sometimes it is. Migrating from BinaryFormatter, it was a surprise to realise that strings can now come back different to how they were saved.>>> "you don't care about the XML format/structure" - indeed; however seeing XML Notepad change our data is worrying and makes me wonder what we're doing wrong.
romkyns
@romkyns: Updated my answer to address your concerns. Overall I don't think you are doing anything "wrong" as long as your objects deserialize correctly. I still don't see why you need to distinguish between an empty line ('\r\n') and an empty string.
Scott Dorman
I understand the issue as well. If you xml serialize a string over the wire that has CRLF's in it, the recipient will get just LF's. It does not round-trip!
Cory R. King