views:

241

answers:

3

I've got some very large XML files which I read using a System.Xml.Serialization.XmlSerializer. It's pretty fast (well, fast enough), but I want it to pool strings, as some long strings occur very many times.

The XML looks somewhat like this:

<Report>
    <Row>
        <Column name="A long column name!">hey</Column>
        <Column name="Another long column name!!">7</Column>
        <Column name="A third freaking long column name!!!">hax</Column>
        <Column name="Holy cow, can column names really be this long!?">true</Column>
    </Row>
    <Row>
        <Column name="A long column name!">yo</Column>
        <Column name="Another long column name!!">53</Column>
        <Column name="A third freaking long column name!!!">omg</Column>
        <Column name="Holy cow, can column names really be this long!?">true</Column>
    </Row>
    <!-- ... ~200k more rows go here... -->
</Report>

And the classes the XML is deserialized into look somewhat like this:

class Report 
{
    public Row[] Rows { get; set; }
}
class Row 
{
    public Column[] Columns { get; set; }
}
class Column 
{
    public string Name { get; set; }
    public string Value { get; set; }
}

When the data is imported, a new string is allocated for every column name. I can see why that is so, but according to my calculations, that means a few duplicated strings make up some ~50% of the memory used by the imported data. I'd consider it a very good trade-off to spend some extra CPU cycles to cut memory consumption in half. Is there some way to have the XmlSerializer pool strings, so that duplicates are discarded and can be reclaimed the next time a gen0 GC occurs?


Also, some final notes:

  • I can't change the XML schema. It's an exported file from a third-party vendor.

  • I know could (theoretically) make a faster parser using an XmlReader instead, and it would not only allow me to do my own string pooling, but also to process data during mid-import so that not all 200k lines have to be saved in RAM until I've read the entire file. Still, I'd rather not spend the time writing and debugging a custom parser. The real XML is a bit more complicated than the example, so it's quite a non-trivial task. And as mentioned above - the XmlSerializer really does perform well enough for my purposes, I'm just wondering if there is an easy way to tweak it a little.

  • I could write a string pool of my own and use it in the Column.Name setter, but I'd rather not as (1) that means fiddling with auto-generated code, and (2) it opens up for a slew of problems related to concurrency and memory leaks.

  • And no, by "pooling", I don't mean "interning" as that can cause memory leaks.

+1  A: 

Personally, I wouldn't hesitate to hand-crank the entities - either by assuming ownership of the generated code, or doing it manually (and getting rid of the arrays ;-p).

Re concurrency - you could perhaps have a thread-static pool? AFAIK, XmlSerializer just uses the one thread, so this should be fine. It would also allow you to throw the pool away when you're done. So then you could have something like a static pool, but per thread. Then perhaps tweak the setters:

class Column 
{
    private string name, value;
    public string Name {
       get { return this.name; }
       set { this.name= MyPool.Get(value); }
    }
    public string Value{
       get { return this.value; }
       set { this.value = MyPool.Get(value); }
    }
}

where the static MyPool.Get method talks to a static field (HashSet<string>, presumably) decorated with [ThreadStatic].

Marc Gravell
I've considered doing it this way. It's probably an OK approach, but I was hoping for a more elegant solution though, which doesn't require me to remember that there is a string pool hidden somewhere :)
gustafc
A: 

You can use OnDeserializedAttribute to define a method which is called after an instance is deserialised, if you use the DataContract serialiser (as WCF uses) rather than using the XmlSerializer.

Alternately, if the XML is not significantly more complex than the example, then why not implement your own deserialisation via XmlReader.

Richard
XmlSerializer doesn't support serialization callbacks...
Marc Gravell
See http://stackoverflow.com/questions/644964, OP checked, and they do get called.
Richard
@Marc: correct... so not sure what happened with the other question. So two different options now in place.
Richard
@Richard; no, they **really** don't. Only for DataContractSerializer.
Marc Gravell
+1  A: 

I suggest that you not pre-optimize this. Wait until it works, profile the result, then optimize based on the results of profiling. You may find there is some other optimization to make first.

John Saunders