views:

347

answers:

3

Hi I face issue parsing xhtml with DOCTYPE declaration using DOM parser.

Error: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd%20

Declaration: DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Is there a way to parse the xhtml to a Document object ignoring the DOCTYPE declaration.

+1  A: 

The parser is required to download the DTD, but you may get around it by setting the standalone attribute on the <?xml... ?> line.

Note however, that this particular error is most likely triggered by a confusion between XML Schema definitions and DTD URL's. See http://www.w3schools.com/xhtml/xhtml_dtd.asp for details. The right one is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
Thorbjørn Ravn Andersen
I used the same DOCTYPE. With the standalone attibute set to "yes" it still gives the same error. Below is my added on top my my xhtml: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">I still get the same error. java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd%20
Rachel
You have a space beween the `.dtd` and the `"`
Thorbjørn Ravn Andersen
This seems to be a common issue as discussed in the blog, http://www.w3.org/2005/06/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
Rachel
A: 

The easiest thing to do is to set validating=false in your DocumentBuilderFactory. If you want to do validation, download the DTD and use a local copy. As commented by Rachel above, this is discussed at The WWW Consortium.

In short, because the default DocumentBuilderFactory downloads the DTD every time it validates, the W3 was getting hit every time a typical programmer tried to parse an XHTML file in Java. They can't afford that much traffic, so they respond with an error.

David Leppik
A: 

A solution that works for me is to give the DocumentBuilder a fake Resolver that returns an empty stream. There's a good explanation here (look at the last message from kdgregory)

http://forums.sun.com/thread.jspa?threadID=5362097

here's kdgregory's solution:

documentBuilder.setEntityResolver(new EntityResolver()
        {
            public InputSource resolveEntity(String publicId, String systemId)
                throws SAXException, IOException
            {
                return new InputSource(new StringReader(""));
            }
        });
xpmatteo