tags:

views:

2490

answers:

2

Hello,

I am trying to parse an HTML document with the doctype declared to use the transitional dtd as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;

When I do Builder.build on the document, I get the following exception:

  java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
       at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1305)
       at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)
       at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
       at nu.xom.Builder.build(Builder.java:1127)
       at nu.xom.Builder.build(Builder.java:1019)

If I remove the doc type declaration, it parses just fine. I can successfully download the dtd from my browser, which tells me that the url is valid. I don't want to remove the doc type declaration. Is there a way tell the builder not to download the dtd or provide it with an alternate dtd?

+2  A: 

Taking a quick look at the javadoc for Builder, I guess you could provide an EntityResolver via the constructor that takes a XMLReader. I would avoid letting the parser download files from the internet where possible.

McDowell
org.apache.xerces.parsers.SAXParser xmlReader = new SAXParser();xmlReader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);Builder xomBuilder = new Builder(xmlReader);
Bala
Why the 503's were happening: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
Bala
Instead of disabling the DTD, I downloaded it, and added it into my software as an embedded resource; and so, then, when the parser wants it, I give it my local/downloaded/cached copy of the DTD, instead of getting it from the internet. This is better I think than completely disabling the DTD processing.
ChrisW
+2  A: 

This solves the problem:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setValidating(false);
            factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
            Document document = factory.newDocumentBuilder().parse(is);
agori