tags:

views:

377

answers:

8

We have an application that requires loading A LOT of configuration data at startup. The data is stored in a XML File which currently is 40MB but will grow to 100MB and more. This data will change while developing but not between releases.

We are looking for a way to speed up the loading process for a "fixed" set of data and one idea is leading to this question:

What would be the easiest/most efficient way to convert the xml file into something which can be delivered as a binary?

For example we could generate a static class with a lot of 'new objectFromXML (param1, param2, ..., paramn)' lines in it's initialization method or we could use one object with a gigantic array containing the data. All this can be done without too much trouble but I suspect that there are more elegant solutions to our problem. Any comments would be highly appreciated.

A: 

Ever thought of using a Resource file for this instead of your own home-rolled XML file? This is pretty much what they're made to do.

Dave Markle
A: 

I ended up using zlib to create a compressed copy of an XML and XSD file in binary format.

ridale
We still would have to parse the XML. If we are "going binary" I expect creating objects during design time and "loading" them at runtime would be much faster.
Ciske
A: 

If you are looking to turn the XML into some sort of object structure you can hit it from one of two sides. First you could create a XSD for the XML if you are mostly using nodes in the XML such as and then use the XSD.exe tool to generate the code to serialize/deserialize this. The Second option would be to have simple POCO objects setup that match the format of the XML and just use the XmlSerializer to turn the XML into the objects.

Adam Gritt
Correct me if I am wrong: your answer offers a way to easily load and process the xml at runtime. We are already doing that. I am looking for a way to distribute the result of that process in binary form.
Ciske
I guess I miss-understood what you were asking. In that case you could just serialize it back out using the Binary Serializer rather than the Xml Serializer and then have the code read it in using the Binary Serializer.
Adam Gritt
+3  A: 

protobuf-net can be compatible with both binary (Google's efficient "protocol buffers" format) and xml at the same time on the same class definitions*.

It can even work without any changes if your xml is element based and includes attributes like [XmlElement(Order = 1)] (to work, it needs to be able to find a unique number per property, you see). Note that if you use inheritance ([XmlInclude]) you'll need to add additional attributes (again, to nominate a number - via the similar [ProtoInclude])

Otherwise, you can add additional attributes, and job done; just call Serializer.Serialize.

Result: smaller, faster serialization.

*=and as proof, this is actually how the codegen works: compile the ".proto" DSL to binary ("protoc"), load the binary into the object-model ("protobuf-net"), write as xml (XmlSerializer) , run through xslt to get C#.


The alternative might be to run the xml through an xslt into C# and compile it, but... ugly. I've done this myself when absolutely needed; it was horrible enough to break reflector! (no, really).

Marc Gravell
I was gonna post this but I figured you'd show up eventually :)
Jason Punyon
Did I get you correctly: protobuf-net offers a faster way of processing xml'y data but I still would have to do the parsing, right?
Ciske
No to both; I'm *assuming* you already have an object-model that you are mapping to the xml via `XmlSerializer` or similar. protobuf-net can use the same object model to read/write binary. So at publish, you load your xml into the object model and write as binary via protobuf-net. At runtime you load the binary into the object model via protobuf-net. I've used this exact trick very successfully on a very recent project.
Marc Gravell
+1  A: 

My first response is: WHY??? An XML file of 40 MB is already huge. Why even store more data inside it? A good way to handle this much data would be by using a database. SQL Server Express is free to install and will work much faster. If you don't want a full server, the Compact edition of SQL Server might be an option, since it basically allow XCopy deployment.

The only advantage of XML is that it's readable for both machines and humans. With a binary format you will need some additional tool to make it human-readable.

Since you're using C#, I'd just go for the SQL Server Compact edition, with an SQL script that adds plenty of logical relations and constraints on the database. An additional Entity Framework class will make the data even more accessible and the only thing you'd need in some XML configuration file would the the connection string...


But if you're stuck with this XML file, the use of ZLIB has already been suggested to compress the whole file.
And since you're dealing with lots of small configuration files inside a bigger structure, you could -as suggested- use ZLIB to create a ZIP file that contains all those small XML structures as separate files. The filename in the ZIP file would be identifying the class that they're for and by reading the specific XML file from the ZIP file, you will improce performance, since the XML parser only needs to parse a little bit. Even if you would need to read 90% of all those XML files, performance would still be good since you're using lots of small XML documents, where the indices are smaller and searching will take less time.

Workshop Alex
We will eventually break down the file in smaller pieces but that doesn't help with the parsing problem. The data comes from a data base :) which will not be available at the customer site.
Ciske
If the data comes from a database, why not replicate it inside a smaller database type? For example, send the data from the big database to a small SQL Compact database and send the compact file to the client. (SQL Compact is filebased and can be used with XCopy deployment.)
Workshop Alex
Actually this is my fallback strategy if everything else fails but the configuration data consists of 'chunks' (each describing an object with services) and we need to be able to deploy each chunk individually. So we will need to have one (ideally binary) file for each object witch can be deployed individually. (should I have said that in the beginning? :))
Ciske
So, basically you're not dealing with one XML file, but an XML library with lots of smaller XML files. :-) You could have mentioned that sooner, because this means you could store those little XML files separately in some binary (zipped) file. That way, use the binary intelligence to find and read the proper XML, then read the little XML, which would parse a lot faster now... :-)
Workshop Alex
+1  A: 

The idea is to write the data in xml but transform that xml into a bytestream as a build step. You can do it by loading the xml into an in-memory object and then do a binary serialization of that object to a file for example. In production just do a binary deserialization and skip the xml altogether.

AZ
+1  A: 

If you want to speed up the loading process, compressing the XML is not going to help you. In fact, it will hurt you: instead of simply parsing the XML, your program will have to uncompress it and then parse it.

You really haven't provided very much information about what you're currently doing. Are you currently loading the XML into an XmlDocument or XDocument and then processing it? If so, the simplest way to speed up the load without changing anything else is to implement a load method that uses an XmlReader, which lets you parse and deserialize the data at the same time.

Are you using XML serialization to produce the XML? If so, you can use protocol buffers, as Marc Gravell suggested, or you can implement binary serialization. This assumes that you don't need the XML for any other purpose.

Do you actually need to deserialize all of the configuration data before your program can function? Or is it possible to use some kind of lazy loading method? If you can do lazy loading, choosing some serialization format that lets you break the loading process into chunks that get performed when the program needs them can speed up the apparent performance of your program (if not the actual performance).

I guess the bottom line is: there are dozens of possible approaches to a problem that's defined as "I need to load a lot of data out of an XML document at startup." Define the problem more precisely, and you'll get more useful suggestions.

Robert Rossney
Thanks for the answer despite my rather concise description of our problem. There are some very usefull hints in there which we will look into.
Ciske
A: 

VTD-XML has the built-in indexing feature called vtd+xml, the basic idea is that you parse XML into VTD, then persist the VTD along with XML into an indexing file... next time you load up the indexed XML document, you don't have to parse it, which speeds up parsing significantly... see the article below

http://www.codeproject.com/KB/XML/VTD-XML-indexing.aspx

vtd-xml-author
How does this even answer the question?
John Saunders
how stupid is your question? are you in fifth grade? it is totally relevant... it is about pre-processing for XML documents...
vtd-xml-author