Try constructing the parser with no_network=False
. As stated in the documentation:
no_network - prevent network access when looking up external documents (on by default)
Imported dtd modules should get retrieved by lxml, but it will not be able to do so if network access is not allowed (this does not count for the document itself, only for loading external referenced documents. In fact, I would expect you to get errors loading the dtd itself, so I assume the document refers to a locally available copy of that dtd, and that it is only the dtd itself that references a remote resource?)
You could also use a catalog to use locally available copies (not only circumventing this problem, but also more performant, and friendlier towards the w3c servers ;-)). Libxml2 (used by lxml) will check for the existance of a catalog in /etc/xml/catalog
, and the XML_CATALOG_FILES
environment variable (see Libxml2 docs)
(it is also possible to write your own resolvers for lxml to intercept and handle requests, but that would probably be overkill in this case)
Note that there is also another option besides parse time validation: use the DTD class to load the dtd separately, and use that as a validator.
This will validate the parsed document with the provided dtd regardless of which dtd (if any) is referenced by doctype declaration (which can be handy: not every valid xml file is necessarily valid according to the dtd you want).
Because the dtd will only have to be retrieved and parsed once, this should be faster if you're validating a lot of documents), and (if I'm not mistaken), you won't run into the no_network problem.
Another bonus of this approached: you can even validate your elements/elementtrees before you've serialized them (if your producing tool uses lxml that is).
A final note: some documents can only be parsed if you have access to the dtd at parse time (unresolvable entities...). Avoid this if you can. (and, although not everyone would agree: avoid doctype declarations altogether if possible).