views:

953

answers:

5

Which XML validation tools can you recommend for both performance and accuracy, each of which is a critical issue on our system? We have the following requirements:

  • It is not not xmllint (see below)
  • Supports RelaxNG
  • Can easily integrate with Perl (this is optional, but it would be nice)

Why not xmllint? (This is background and you can skip it if you like)

We have a large Perl system which uses RelaxNG to validate our XML. We use the compact RelaxNG format and trang to convert it to the standard RelaxNG format. Then we do the actual validation via xmllint.

That's when the problems kick in. xmllint routinely has issues in reporting validation errors incorrectly. It doesn't give false positives or negatives, but if the document fails to validate, xmllint will often report the wrong element or attribute for a given error. Sometimes the error is correct ("did not expect to see element 'bar'), but only because a previous error was not reported (because 'bar' was supposed to be following the required but missing element 'foo', but xmllint doesn't tell us that bit). Note that this is a long-standing problem with xmllint and even the latest version has the same problems. We often have huge XML documents and misreporting the errors causes much grief for both clients and developers.

+3  A: 

I suspect xmllint uses the same underlying libraries (libxml2, etc) as anything else. It is counterintuitive to think that another front-end to the same library would give different results.

JDrago
+10  A: 

I think that JDrago has the right idea, that you need to avoid libxml2-based tools for RNG validation, at least for now. I'm discovering this as well in my project. I recently logged two bugs against libxml2 concerning RNG validation.

I recommend jing. It was written by James Clark, the creator of Relax NG and one of the leading lights in the XML world. He is also the author of trang, which you are already using. Development of this code (and of trang) recently resumed at the Google Code site I link to above.

Jing has proved consistently correct with our content and schema, and to give much better error messages than libxml2, though there is still a lot of room for improvement in that regard.

The one shortcoming of jing vis a vis libxml2/xmllint is that it doesn't at present use OASIS XML catalogs to resolve public and system identifiers and URIs pointing to schemas. This would be an issue in case you have included schemas that are referred to by 'http' URI--those would always be fetched over the network.

ChuckB
+1  A: 

rnv is very fast, free (as in free speech) and runs on the command line (so Perl can invoke it easily). Most of the times, the messages are OK. Unfortunately, it seems no longer maintained.

bortzmeyer
+1  A: 

i am the author of RNV. It is maintained on sourceforge.net, and there is a maintainer who takes care of both sourceforge and debian package builts. The fact is that the code is not changed is due to the code being stable. There are no bugs reported.

That's clearly not true. For instance, the xsd:anyURI bug I reported in february 2006 was completely ignored.The bug is that URI with @ or , like http://www.lemonde.fr/web/article/0,1-0@2-651865,36-735912@51-722775,0.html are wrongly refused.http://www.w3.org/TR/xmlschema-2/#anyURI
bortzmeyer
+1  A: 

Hamcrest Schema allows you to validate XML documents against RelaxNG using Hamcrest Matchers.

Wilfred Springer