views:

616

answers:

5

I have large batches of XHTML files that are manually updated. During the review phase of the updates i would like to programmatically check the well-formedness of the files. I am currently using a XmlReader, but the time required on an average CPU is much longer than i expected.

The XHTML files range in size from 4KB to 40KB and verifying takes several seconds per file. Checking is essential but i would like to keep the time as short as possible as the check is performed while files are being read into the next process step.

Is there a faster way of doing a simple XML well-formedness check? Maybe using external XML libraries?


I can confirm that validating "regular" XML based content is lightning fast using the XmlReader, and as suggested the problem seems to be related to the fact that the XHTML DTD is read each time a file is validated.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;

Note that in addition to the DTD, corresponding .ent files (xhtml-lat1.ent, xhtml-symbol.ent, xhtml-special.ent) are also downloaded.

Since ignoring the DTD completely is not really an option for XHTML as the well-formedness is closely linked to allowed HTML entities (e.g., a &nbsp; will promptly introduce validation errors when we ignore the DTD).


The problem was solved by using a custom XmlResolver as suggested, in combination with local (embedded) copies of both the DTD and entity files.

I will post the solution here once i cleaned up the code

+4  A: 

I would expect that XmlReader with while(reader.Read)() {} would be the fastest managed approach. It certainly shouldn't take seconds to read 40KB... what is the input approach you are using?

Do you perhaps have some external (schema etc) entities to resolve? If so, you might be able to write a custom XmlResolver (set via XmlReaderSettings) that uses locally cached schemas rather than a remote fetch...

The following does ~300KB virtually instantly:

    using(MemoryStream ms = new MemoryStream()) {
        XmlWriterSettings settings = new XmlWriterSettings();
        settings.CloseOutput = false;
        using (XmlWriter writer = XmlWriter.Create(ms, settings))
        {
            writer.WriteStartElement("xml");
            for (int i = 0; i < 15000; i++)
            {
                writer.WriteElementString("value", i.ToString());
            }
            writer.WriteEndElement();
        }
        Console.WriteLine(ms.Length + " bytes");
        ms.Position = 0;
        int nodes = 0;
        Stopwatch watch = Stopwatch.StartNew();
        using (XmlReader reader = XmlReader.Create(ms))
        {
            while (reader.Read()) { nodes++; }
        }
        watch.Stop();
        Console.WriteLine("{0} nodes in {1}ms", nodes,
            watch.ElapsedMilliseconds);
    }
Marc Gravell
This is basically what i am using now. I am reading the files directly (i also tried first reading them into a FileStream, but this does not change much). I set the XmlWriterSettings .ProhibitDtd to False.
barry
I will have a look and check if the problem is in the referenced DTD and namespaces.
barry
Like you indicated the problem seems to be the XHTML DTD.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
barry
+2  A: 

Create an XmlReader object by passing in an XmlReaderSettings object that has the ConformanceLevel.Document.

This will validate well-formedness.

This MSDN article should explain the details.

Cerebrus
I tried adding the ConformanceLevel.Document to the settings, but there was no noticeable speed increase.
barry
This is one of the fastest methods. As Marc implied, your speed problem is likely due to some other reason. You should edit your post to include the code you are using.
Cerebrus
+1  A: 

On my fairly ordinary laptop, reading a 250K XML document from start to finish with an XmlReader takes 6 milliseconds. Something else besides just parsing XML is the culprit.

Robert Rossney
Thanks, the issue is likely the suggested DTD that is being used for every check.
barry
A: 

As others mentioned, the bottleneck is most likely not the XmlReader.

Check if you wouldn't happen to do a lot of string concatenation without a stringbuilder.

That can really nuke your performance.

Sylverdrag
The problem seems to be in the DTD being re-read every check (when using regular XML files instead of XHTML files the checks run fast as expected).
barry
A: 

Personally, I'm pretty lazy ... so I look for .NET libraries that already solve the problem. Try using the DataSet.ReadXML() function and catch the exceptions. It does a pretty amazing job of explaining the XML format errors.

Ron

Ron Savage