I have large batches of XHTML files that are manually updated. During the review phase of the updates i would like to programmatically check the well-formedness of the files. I am currently using a XmlReader, but the time required on an average CPU is much longer than i expected.
The XHTML files range in size from 4KB to 40KB and verifying takes several seconds per file. Checking is essential but i would like to keep the time as short as possible as the check is performed while files are being read into the next process step.
Is there a faster way of doing a simple XML well-formedness check? Maybe using external XML libraries?
I can confirm that validating "regular" XML based content is lightning fast using the XmlReader, and as suggested the problem seems to be related to the fact that the XHTML DTD is read each time a file is validated.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Note that in addition to the DTD, corresponding .ent files (xhtml-lat1.ent, xhtml-symbol.ent, xhtml-special.ent) are also downloaded.
Since ignoring the DTD completely is not really an option for XHTML as the well-formedness is closely linked to allowed HTML entities (e.g., a will promptly introduce validation errors when we ignore the DTD).
The problem was solved by using a custom XmlResolver as suggested, in combination with local (embedded) copies of both the DTD and entity files.
I will post the solution here once i cleaned up the code