views:

206

answers:

2

High bounty for the following Q:

Hello, Here is what I tried on Ubuntu 9.10 using Python 2.6, Amara2 (by the way, test.xsd was created using xml2xsd tool):

g@spot:~$ cat test.xml; echo =====o=====; cat test.xsd; echo ==== 
o=====; cat test.py; echo =====o=====; ./test.py; echo =====o===== 
<?xml version="1.0" encoding="utf-8"?>==; ./test.py` > 
test.txttest.xsd; echo === 
<test>abcde</test> 
=====o===== 
<?xml version="1.0" encoding="UTF-8"?> 
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
elementFormDefault="qualified"> 
  <xs:element name="test" type="xs:NCName"/> 
</xs:schema> 
=====o===== 
#!/usr/bin/python2.6 
# I wish to validate an xml file against an external XSD schema. 
from amara import bindery, parse 
source = 'test.xml' 
schema = 'test.xsd' 
#help(bindery.parse) 
#doc = bindery.parse(source, uri=schema, validate=True) # These 2 seem 
to fail in the same way. 
doc = parse(source, uri=schema, validate=True) # So, what is the 
difference anyway? 
# 
=====o===== 
Traceback (most recent call last): 
  File "./test.py", line 14, in <module> 
    doc = parse(source, uri=schema, validate=True) 
  File "/usr/local/lib/python2.6/dist-packages/Amara-2.0a4-py2.6-linux- 
x86_64.egg/amara/tree.py", line 50, in parse 
    return _parse(inputsource(obj, uri), flags, 
entity_factory=entity_factory) 
amara.ReaderError: In file:///home/g/test.xml, line 2, column 0: 
Missing document type declaration 
g@spot:~$ 
=====o===== 

So, why am I seeing this error? Is this functionality not supported? How can I validate an XML file against an XSD while having the flexibility to point to any XSD file? Thanks, and let me know if you have questions.

+2  A: 

If you're open to using another library besides amara, try lxml. It supports what you're trying to do pretty easily:

from lxml import etree

source_file = 'test.xml'
schema_file = 'test.xsd'

with open(schema_file) as f_schema:

    schema_doc = etree.parse(f_schema)
    schema = etree.XMLSchema(schema_doc)
    parser = etree.XMLParser(schema = schema)

    with open(source_file) as f_source:
        try:
            doc = etree.parse(f_source, parser)
        except etree.XMLSyntaxError as e:
            # this exception is thrown on schema validation error
            print e
ma3
Thanks, I might switch over - amara as is is a hassle. How can I do something similar to `for q in doc.quotes.quote: # The loop will pick up both q elements` fromhttp://wiki.xml3k.org/Amara2/Tutorial ? I was initially sold on the auto-binding, because it is supposedly the pythonic way. But, my discomfort with amara is growing ...
Hamish Grubijan
That's a separate question from the original. (But the way I'd do it is with xpath... `for q in doc.xpath('quotes/quote'): ...`) With lxml you can do pretty much any xml/xsl/xpath/xsd task you'd need.
ma3
+1  A: 

I'll recommend you to use noNamespaceSchemaLocation attribute to bind the XML file to the XSD schema. Then your XML file test.xml will be

<?xml version="1.0" encoding="utf-8"?>
<test xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="test.xsd">abcde</test>

where the file test.xsd

<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified">
    <xs:element name="test" type="xs:NCName"/>
</xs:schema>

should be placed in the same directory as the test.xsd. It is general technique to reference the XML schema from the XML file and it should work in Python.

The advantage is that you don't need to know the schema file for every XML file. It will be automatically found during parsing (etree.parse) of the XML file.

Oleg
But he wants the flexibility to point to any XSD, not just the one given in the xml file (if any).
ma3
@ma3204: If somebody write an XML document he write it corresponds to one schema. You should not try to interpret the document in other schema. XML is a metalanguage. XSD define a specific language. It you have a text written in one language you should not try to interpret is as a text in another language. So only the person **who write** an XML document can specify the XSD for it.
Oleg
I upvoted, but my use case is different. The Xml is auto-generated daily (for testing), but schema is fixed precisely because the auto-generator can screw up.
Hamish Grubijan
@Hamish Grubijan: OK, but if you try to generate an XML file you try to follow a schema. Why not to include the reference to the schema? The value of `xsi:noNamespaceSchemaLocation` is a path to the XML schema. So you have not write the XSD file on every test, you can write only a reference to the existing XSD file which schema you want to follow. But it is just a suggestion. I follow the role since some years and can only recommend it. Mostly I use schema with a namespace, so I used use `xsi:schemaLocation` attribute instead of `xsi:noNamespaceSchemaLocation`.
Oleg
@Hamish Grubijan: Including the information about the schema used in the XML file is close to the including of the `encoding` attribute in the `<?xml>` directive: you can create an XML without it, but you should better use it. It makes some things more clear.
Oleg
@Oleg, this is no doubt what should be done in most cases. In my case producers of xml, xsd will likely reside on different computers, and one computer should be easily replaced with another. Also, I might need to keep the xml and/or xsd files behind a password-protected sftp, or some other means. Is this over-engineered? Perhaps. But, for what I am trying to do, I see the restriction being discussed unnecessarily rigid. I believe that this answer suggests a good way to go about things and will help many people who read this question later, but right here and now I have something else in mind.
Hamish Grubijan
@Hamish Grubijan: I understand you arguments and agree, that for distributed systems usage of `xsi:schemaLocation` or `xsi:noNamespaceSchemaLocation` has less or no sense. Now I understand what you mention before.
Oleg