I've got some very large XML files which I read using a System.Xml.Serialization.XmlSerializer
. It's pretty fast (well, fast enough), but I want it to pool strings, as some long strings occur very many times.
The XML looks somewhat like this:
<Report>
<Row>
<Column name="A long column name!">hey</Column>
<Column name="Another long column name!!">7</Column>
<Column name="A third freaking long column name!!!">hax</Column>
<Column name="Holy cow, can column names really be this long!?">true</Column>
</Row>
<Row>
<Column name="A long column name!">yo</Column>
<Column name="Another long column name!!">53</Column>
<Column name="A third freaking long column name!!!">omg</Column>
<Column name="Holy cow, can column names really be this long!?">true</Column>
</Row>
<!-- ... ~200k more rows go here... -->
</Report>
And the classes the XML is deserialized into look somewhat like this:
class Report
{
public Row[] Rows { get; set; }
}
class Row
{
public Column[] Columns { get; set; }
}
class Column
{
public string Name { get; set; }
public string Value { get; set; }
}
When the data is imported, a new string is allocated for every column name. I can see why that is so, but according to my calculations, that means a few duplicated strings make up some ~50% of the memory used by the imported data. I'd consider it a very good trade-off to spend some extra CPU cycles to cut memory consumption in half. Is there some way to have the XmlSerializer
pool strings, so that duplicates are discarded and can be reclaimed the next time a gen0 GC occurs?
Also, some final notes:
I can't change the XML schema. It's an exported file from a third-party vendor.
I know could (theoretically) make a faster parser using an
XmlReader
instead, and it would not only allow me to do my own string pooling, but also to process data during mid-import so that not all 200k lines have to be saved in RAM until I've read the entire file. Still, I'd rather not spend the time writing and debugging a custom parser. The real XML is a bit more complicated than the example, so it's quite a non-trivial task. And as mentioned above - theXmlSerializer
really does perform well enough for my purposes, I'm just wondering if there is an easy way to tweak it a little.I could write a string pool of my own and use it in the
Column.Name
setter, but I'd rather not as (1) that means fiddling with auto-generated code, and (2) it opens up for a slew of problems related to concurrency and memory leaks.And no, by "pooling", I don't mean "interning" as that can cause memory leaks.