ansaurus

Question

What is the fastest way to programatically check the well-formedness of XML files in C#?

Answer 1

+4 A:

I would expect that XmlReader with while(reader.Read)() {} would be the fastest managed approach. It certainly shouldn't take seconds to read 40KB... what is the input approach you are using?

Do you perhaps have some external (schema etc) entities to resolve? If so, you might be able to write a custom XmlResolver (set via XmlReaderSettings) that uses locally cached schemas rather than a remote fetch...

The following does ~300KB virtually instantly:

    using(MemoryStream ms = new MemoryStream()) {
        XmlWriterSettings settings = new XmlWriterSettings();
        settings.CloseOutput = false;
        using (XmlWriter writer = XmlWriter.Create(ms, settings))
        {
            writer.WriteStartElement("xml");
            for (int i = 0; i < 15000; i++)
            {
                writer.WriteElementString("value", i.ToString());
            }
            writer.WriteEndElement();
        }
        Console.WriteLine(ms.Length + " bytes");
        ms.Position = 0;
        int nodes = 0;
        Stopwatch watch = Stopwatch.StartNew();
        using (XmlReader reader = XmlReader.Create(ms))
        {
            while (reader.Read()) { nodes++; }
        }
        watch.Stop();
        Console.WriteLine("{0} nodes in {1}ms", nodes,
            watch.ElapsedMilliseconds);
    }

Marc Gravell 2009-02-09 08:34:59

This is basically what i am using now. I am reading the files directly (i also tried first reading them into a FileStream, but this does not change much). I set the XmlWriterSettings .ProhibitDtd to False.

barry 2009-02-09 09:00:44

I will have a look and check if the problem is in the referenced DTD and namespaces.

barry 2009-02-09 09:22:51

Like you indicated the problem seems to be the XHTML DTD.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

barry 2009-02-10 04:04:00

Answer 2

+2 A:

Create an XmlReader object by passing in an XmlReaderSettings object that has the ConformanceLevel.Document.

This will validate well-formedness.

This MSDN article should explain the details.

Cerebrus 2009-02-09 08:40:35

I tried adding the ConformanceLevel.Document to the settings, but there was no noticeable speed increase.

barry 2009-02-09 09:24:19

This is one of the fastest methods. As Marc implied, your speed problem is likely due to some other reason. You should edit your post to include the code you are using.

Cerebrus 2009-02-09 09:41:30

Answer 3

+1 A:

On my fairly ordinary laptop, reading a 250K XML document from start to finish with an XmlReader takes 6 milliseconds. Something else besides just parsing XML is the culprit.

Robert Rossney 2009-02-09 11:30:34

Thanks, the issue is likely the suggested DTD that is being used for every check.

barry 2009-02-10 04:16:24

Answer 4

A:

As others mentioned, the bottleneck is most likely not the XmlReader.

Check if you wouldn't happen to do a lot of string concatenation without a stringbuilder.

That can really nuke your performance.

Sylverdrag 2009-02-09 15:52:01

The problem seems to be in the DTD being re-read every check (when using regular XML files instead of XHTML files the checks run fast as expected).

barry 2009-02-10 04:17:46

Answer 5

A:

Personally, I'm pretty lazy ... so I look for .NET libraries that already solve the problem. Try using the DataSet.ReadXML() function and catch the exceptions. It does a pretty amazing job of explaining the XML format errors.

Ron

Ron Savage 2009-02-10 04:40:14

ansaurus

tags:

views:

answers:

What is the fastest way to programatically check the well-formedness of XML files in C#?

related questions