tags:

views:

201

answers:

2

I'm applying a xslt to a HTML file (already filtered and tidied to make it parseable as XML).

My code looks like this:

TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.xslt = transformerFactory.newTransformer(xsltSource);
xslt.transform(sanitizedXHTML, result);

However, I receive error for every doctype found like this:

ERROR: 'Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/loose.dtd'

I have no issue accessing the dtds from my browser.

I have little control over the HTML being parsed, and can't rip the DOCTYPE since I need them for entities.

Any help is welcome.

EDIT:

I tried to disable DTD validation like this:

private Source getSource(StreamSource sanitizedXHTML) throws ParsingException {
    SAXParserFactory spf = SAXParserFactory.newInstance();
    spf.setNamespaceAware(false);
    spf.setValidating(false);  // Turn off validation

    XMLReader rdr;

    try {
        rdr = spf.newSAXParser().getXMLReader();
    } catch (SAXException e) {
        throw new ParsingException(e);
    } catch (ParserConfigurationException e) {
        throw new ParsingException(e);
    }

    InputSource inputSrc = new InputSource(sanitizedXHTML.getInputStream());
    return new SAXSource(rdr, inputSrc);
}

and then just calling it...

    Source source = getSource(sanitizedXHTML);
    xslt.transform(source, result);

The error persists.

EDIT 2:

Wrote a entity resolver, and got HTML 4.01 Transitional DTD on my local disk. However, I get this error now:

ERROR: 'The declaration for the entity "HTML.Version" must end with '>'.'

The DTD is as is, downloaded from w3.org

+2  A: 

I have some suggestions in an answer to a related question.

In particular, when parsing the XML document, you might want to turn DTD validation off, to prevent the parser from trying to fetch the DTD. Alternatively, you might use your own entity resolver to return a local copy of the DTD instead of fetching it over the network.


Edit: Just calling setValidating(false) on the SAX Parser Factory might not be enough to prevent the parser from loading the external DTD. The parser may need the DTD for other purposes, such as entity definitions. (Perhaps you could change your HTML sanitization/preprocessing phase to replace all entity references with the equivalent numeric character entity references, eliminating the need for the DTD?)

I don't think there is a standard SAX feature flag which would ensure that external DTD loading is completely disabled, so you might have to use something specific to your parser. So if you are using Xerces, for example, you might want to look up Xerces-specific features and call setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false) just to be sure.

Jukka Matilainen
Thanks for the advice, but the problem persists. I just edited showing how I attempted to disable DTD validation.
Johnco
Your edit did it!
Johnco
+1  A: 

Assuming you want the DTD loaded (for your entities), you will need to use a resolver. The basic problem that you are encountering is that the W3C limits access to the urls for the DTDs for performance reasons (they don't get any performance if they don't).

Now you should be working with a local copy of the DTD and using a catalog to handle this. You should take a look at the Apache Commons Resolver. If you don't know how to use a catalog, they're well documented in Norm Walsh's article

Of course, you will have problems if you do validate. That's an SGML DTD and you are trying to use it for XML. This will not work (probably)

Nic Gibson
Tried it, got an error when parsing the local DTD from disk. Check my edits please.
Johnco