views:

127

answers:

1

As per the official documentation of lxml, if one wants to validate a xml document against a xml schema document, one has to

  1. construct the XMLSchema object (basically, parse the schema document)
  2. construct the XMLParser, passing the XMLSchema object as its schema argument
  3. parse the actual xml document (instance document) using the constructed parser

There can be variations, but the essense is pretty much the same no matter how you do it, - the schema is specified 'externally' (as opposed to specifying it inside the actual xml document).

If you follow this procedure, the validation occurs, sure enough, but if I understand it correctly, that completely ignores the whole idea of the schemaLocation and noNamespaceSchemaLocation attributes from xsi

This introduces a whole bunch of limitations, starting with the fact, that you have to deal with instance<->schema relation all by yourself (either store it externally or write some hack to retrieve the schema location from the root element of the instance document), you can not validate the document using multiple schemata (say, when each schema governs its own namespace) and so on.

So the question is: maybe I am missing something completely trivial or doing it wrong? Or are my statements about lxml's limitations regarding schema validation true?

To recap, I'd like to be able to:

  • have the parser use the schema location declarations in the instance document at parse/validation time
  • use multiple schemata to validate a xml document
  • declare schema locations on non-root elements (not of extreme importance)

Maybe I should look for a different library? Although, that'd be a real shame, - lxml is a de-facto xml processing library for python and is regarded by everyone as the best one in terms of performace/features/convenience (and rightfully so, to a certain extent)

+1  A: 

Caution: this is not the full answer to this, because I don't know all that much about lxml in particular.

In can just tell you that:

  • Ignoring schemalocations in documents and instead managing a namespace -> schema file mapping in an application is almost always better, unless you can guarantee that the schema will be in a very specific location compared to the file. If you want to move it out of code, use a catalogue or come up with a configuration file.
  • If you do want to use schemaLocation, and want to validate multiple schemas, just include them all in one schemaLocation attribute, separated by spaces, in namespace URI/location pairs: xsi:schemaLocation="urn:schema1 schema1.xsd urn:schema2 schema2.xsd.
  • Finally, I don't think any processor will find schemaLocation attributes declared on non-root elements. Not that it matters: just put them all on the root.
xcut
Your answer most definitely makes sense (+1). While I am quite familiar with the theory behind XML Schema, so to speak, I haven't met the more complex use cases (such as the ones, I've described in the question) in practice so I am lacking the knowledge of the best practices. If your first point is right (I'll do some more research to be safe), then I'll have to look for the ways of organizing/cataloguing the namespace -> schema relations and altering the logic to fit this new workflow. Thanks for the information!
shylent
oh well, I suppose I won't get much advice on this on SO after all. Thanks again for the info
shylent
I saw you put a bounty on. Sorry nobody managed to get on top of it, I certainly didn't deserve it for the answer I gave before!
xcut
@xcut Oh, it is ok :) I mean, you've made the effort and I understand, that my question doesn't probably have a clear answer. I really appreciate your input, though, as it certainly gave me food for thought. Thanks again!
shylent