tags:

views:

2035

answers:

4

I want to read an XML file that has a schema declaration in it.

And that's all I want to do, read it. I don't care if it's valid, but I want it to be well formed.

The problem is that the reader is trying to read the schema file, and failing.

I don't want it to even try.

I've tried disabling validation, but it still insists on trying to read the schema file.

Ideally, I'd like to do this with a stock Java 5 JDK.

Here's what I have so far, very simple:

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setValidating(false);
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(file);

and here's the exception I am getting back:

java.lang.RuntimeException: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Yes, this HAPPENS to be an XHTML schema, but this isn't an "XHTML" issue, it's an XML issue. Just pointing that out so folks don't get disrtacted. And, in this case, the W3C is basically saying "don't ask for this thing, it's a silly idea", and I agree. But, again, it's a detail of the issue, not the root of it. I don't want to ask for it AT ALL

Thanx!

A: 

I've not tested this, but you could try calling setSchema on the factory passing null.

i.e.

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setSchema(null);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);

Update: Looking at DocumentBuilderImpl it looks like this might work, from the constructor it will check the grammar from the factory before checking the schema.

From DocumentBuilderFactoryImpl:

public void setSchema(Schema grammar) {
    this.grammar = grammar;
}

From DocumentBuilderImpl constructor:

...
this.grammar = dbf.getSchema();
if (grammar != null) {
    XMLParserConfiguration config = domParser.getXMLParserConfiguration();
    XMLComponent validatorComponent = null;
    /** For Xerces grammars, use built-in schema validator. **/
    ...
}
Rich Seller
Sorry, doesn't seem to work.
jpatokal
+7  A: 

The reference is not for Schema, but for a DTD.

DTD files can contain more than just structural rules. They can also contain entity references. XML parsers are obliged to load and parse DTD references, because they could contain entity references that might affect how the document is parsed and the content of the file(you could have an entity reference for characters or even whole phrases of text).

If you want to want to avoid loading and parsing the referenced DTD, you can provide your own EntityResolver and test for the referenced DTD and decide whether load a local copy of the DTD file or just return null.

Code sample from the referenced answer on custom EntityResolvers:

   builder.setEntityResolver(new EntityResolver() {
        @Override
        public InputSource resolveEntity(String publicId, String systemId)
                throws SAXException, IOException {
            if (systemId.contains("foo.dtd")) {
                return new InputSource(new StringReader(""));
            } else {
                return null;
            }
        }
    });
Mads Hansen
I was thinking this is what I would have to do, I simply made an "empty" EntityResolver that always returns the empty InputSource for everthing. This seemed to do the trick.
Will Hartung
I faced the same issue and i applied this solution. It solved the IOException. My concern is that the DOCTYPE is getting lost with empty entity reolver. I want to have this DOCTYPE retained and not removed from the input. Is it possible.
Rachel
+1  A: 

The issue here isn't one of validation. Regardless of validation settings, the parser will still attempt to resolve any references in your document, such as entities, DTDs and (sometimes) schemas. It's only later on that it decides to validate using them (or not). You need to plug in an entity resolver to "intercept" these attempts at de-referencing.

Check out Apache XML Resolver for an easy(ish) way to do this.

skaffman
A: 

The simplest answer is this one-liner, called after creating the DocumentBuilderFactory:

dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

Shamelessly cribbed from http://stackoverflow.com/questions/155101/make-documentbuilder-parse-ignore-dtd-references.

jpatokal