ansaurus

Question

Optimizing XML in C#

Answer 1

A:

The first part of your question sounds like a schema validation would work best. If you have access to the XSD's or can create them you could use an algorithm similar to this:

    public void ValidateXmlToXsd(string xsdFilePath, string xmlFilePath)
    {
        XmlSchema schema = ValidateXsd(xsdFilePath);
        XmlDocument xmlData = new XmlDocument();
        XmlReaderSettings validationSettings = new XmlReaderSettings();

        validationSettings.Schemas.Add(schema);
        validationSettings.Schemas.Compile();
        validationSettings.ValidationFlags = XmlSchemaValidationFlags.ProcessInlineSchema;
        validationSettings.ValidationType = ValidationType.Schema;
        validationSettings.ValidationEventHandler += new ValidationEventHandler(ValidationHandler);
        XmlReader xmlFile = XmlReader.Create(xmlFilePath, validationSettings);

        xmlData.Load(xmlFile);
        xmlFile.Close();
    }

    private XmlSchema ValidateXsd(string xsdFilePath)
    {
        StreamReader schemaFile = new StreamReader(xsdFilePath);
        XmlSchema schema = XmlSchema.Read(schemaFile, new ValidationEventHandler(ValidationHandler));
        schema.Compile(new ValidationEventHandler(ValidationHandler));
        schemaFile.Close();
        schemaFile.Dispose();

        return schema;
    }

    private void ValidationHandler(object sender, ValidationEventArgs e)
    {
        throw new XmlSchemaException(e.Message);
    }

If the xml fails to validate the XmlSchemaException is thrown.

As for LINQ, I personally prefer to use XDocument whenever I can over XmlDocument. Your goal is somewhat subjective and without seeing exactly what you're doing I can't say go this way or go that way with any certainty that it would help you. You can use XPath with XDocument. I would have to say that you should use whichever suits your needs best. There's no issue with using XPath sometimes and LINQ other times. It really depends on your comfort level along with scalability and readability. What will benefit the team, so to speak.

Alexander Kahoun 2009-05-21 17:02:57

We already use XSD validation - our main bottleneck is disk access and I would like to find a way, I guess, of randomly accessing the XML so as not to load the entire file before we use it. I suspect it won't be possible without some major rework.

Jeff Yates 2009-05-21 17:04:14

Answer 2

A:

An XmlReader will use less memory than an XmlDocument because it doesn't need to load the entire XML into memory at one time.

David 2009-05-21 17:11:00

Yup, it is as it is non-cached. However, this may not be the fastest way to do it - I think a compromise is needed, if there is such a thing.

Jeff Yates 2009-05-21 17:21:27

Answer 3

+1 A:

This might sound stupid.
But, if you have simple things to query, you can use regex over xml files. (the way they do grep in unix/linux).

I apologize if it doesn't make any sense.

shahkalpesh 2009-05-21 19:15:42

Answer 4

+1 A:

With XML I only know of two ways

XMLReader -> stream the large XML data in or use the XML DOM object model and read the entire XML in at once into memory.

If the XML is big, we have XML files in 80 MB range and up, reading the XML into memory is a performance hit. There is no real way to "merge" the two ways of dealing with XML documents. Sorry.

mcauthorn 2009-05-21 19:29:44

If you're using 80MB of XML, it may be time to consider using a database as your primary store and relegating XML exclusively to data transfer. This has the added benefit of letting you "query" using "structured query language".

Greg D 2009-06-17 17:24:47

We actually are in fact doing that. The XML is the result of a query. Due to it's size we are always concerned with speed and overhead to consume that query.

mcauthorn 2009-06-17 17:37:13

I am using XmlReader to read the XML into XDocument and then Linq-to-XML to query the contents. By moving the loading into a streaming extension method, I can load multiple large documents sequentially and leverage LINQ (making sure not to repeat the load multiple times). So far, it's working quite well (29 MB of files in under 30 seconds). I skip XSD validation as we can safely assume at this point that the files are valid, which obviously gives us a speed enhancement.

Jeff Yates 2009-08-23 20:35:37

In addition, we're not using a DB because we don't yet know what structure that DB should take, nor do we have the luxury of time to get the data into the DB before querying it - hence the LINQ-to-XML approach.

Jeff Yates 2009-08-23 20:36:22

Answer 5

A:

Just a thought on the comments of JMarsch. Even if the XML generation your process is not up for discussion, have you considered a DB (or a subset of XML files acting as indexes) as an intermediary? This would obviously only be of benefit if the XML files aren't updated more that once or twice a day. I guess this would need to be weighed up against your existing caching mechanism.

I can't speak to speed, butt I prefer XDocument/LINQ because of the syntax.

Rich

kim3er 2009-05-22 11:30:53

Answer 6

+1 A:

I ran across this white paper a while ago when I was trying to stream XML: API-based XML streaming with FLWOR power and functional updates The paper tries to work with in memory XML but leverage LINQ accessing.

Maybe someone will find it interesting.

Richard Morgan 2009-05-23 16:36:55

I tried this also, there are also some people who have published working classes that fit the API described in the paper. More or less they sort of seek() to a location in the underlying stream and instanciate XLinq objects from there for ease of use. It's unfortunate that an input XStreamingElement analouge is not available.

RandomNickName42 2009-09-30 08:52:51

ansaurus

tags:

views:

answers:

Optimizing XML in C#

Background

Question

Example

Goal

related questions