So in my current project I use the JAXB RI with the default Java parser from Sun's JRE (which I believe is Xerces) to unmarshal arbitrary XML.
First I use XJC to compile an XSD of the following form:
<?xml version="1.0" encoding="utf-8" ?>
<xs:schema attributeFormDefault="unqualified"
elementFormDefault="qualified"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="foobar">
...
</xs:element>
</xs:schema>
In the "good case" everything works as designed. That is to say if I'm passed XML that conforms to this schema then JAXB correctly unmarshals it into an object tree.
The problem comes when I'm passed XML with an external DTD references, e.g.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foobar SYSTEM "http://blahblahblah/foobar.dtd">
<foobar></foobar>
Upon unmarshalling something like this, the SAX parser attempts to load the remote entity ("http://somehost/foobar.dtd") despite the fact that this snippet clearly does not conform to the schema I compiled earlier with XJC.
In order to circumvent this behavior, since I know that any conformant XML (according to the XSD I compiled) will never require the loading of a remote entity, I have to define a custom EntityResolver that short circuits the loading of all remote entities. So instead of doing something like:
MyClass foo = (MyClass) myJAXBContext.createUnmarshaller().unmarshal(myReader);
I'm forced to do this:
XMLReader myXMLReader = mySAXParser.getXMLReader();
myXMLReader.setEntityResolver(myCustomEntityResolver);
SAXSource mySAXSource = new SAXSource(myXMLReader, new InputSource(myReader));
MyClass foo = (MyClass) myJAXBContext.createUnmarshaller().unmarshal(mySAXSource);
So my ultimate question is:
When unmarshalling with JAXB, should the loading of remote entities by the underlying SAX parser be automatically short circuited when the XML in question can be recognized as invalid without the loading of those remote entities?
Also, doesn't this seem like a security issue? Given that JAX-WS relies on JAXB under the hood, it seems like I could pass specially crafted XML to any JAX-WS based web service and cause the WS host to load any arbitrary URL.
I'm a relative newbie to this, so there's probably something I'm missing. Please let me know if so!