views:

502

answers:

3

I have program that needs to parse XML that contains character entities. The program itself doesn't need to have them resolved, and the list of them is large and will change, so I want to avoid explicit support for these entities if I can.

Here's a simple example:

<?xml version="1.0" encoding="UTF-8"?>
<xml>Hello there &something;</xml>

Is there a Java XML API that can parse a document successfully without resolving (non-standard) character entities? Ideally it would translate them into a special event or object that could be handled specially, but I'd settle for an option that would silently suppress them.

Answer & Example:

Skaffman gave me the answer: use a StAX parser with IS_REPLACING_ENTITY_REFERENCES set to false.

Here's the code I whipped up to try it out:

XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader = inputFactory.createXMLEventReader(
    new FileInputStream("your file here"));

while (reader.hasNext()) {
    XMLEvent event = reader.nextEvent();
    if (event.isEntityReference()) {
        EntityReference ref = (EntityReference) event;
        System.out.println("Entity Reference: " + ref.getName());
    }
}

For the above XML, it will print "Entity Reference: something".

A: 

A SAX parse with an org.xml.sax.EntityResolver might suit your purpose. You could for sure suppress them, and you could probably find a way to leave them unresolved.

This tutorial seems the most relevant: it shows how to resolve entities into strings.

Jim Ferrans
I tried that out. It appears that EntityResolvers are only used for external entities: in this case, the resolveEntity(...) method isn't getting called, and the parser fails with "org.xml.sax.SAXParseException: The entity "something" was referenced, but not declared."
cosmic.osmo
+2  A: 

The STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:

Requires the parser to replace internal entity references with their replacement text and report them as characters

This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader. However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to not replace them. Still, it's got to be worth a try.

skaffman
That's EXACTLY what I want. If you set that property to false, you'll see EntityReference events in the stream, from which you can get the entity name via the getName() method.
cosmic.osmo
A: 

I am not a Java developer, but I "think" Java xml classes support a similar functionality to .net for accomplishing this. IN .net the xmlreadersettings class you set the ProhibitDtd property false and set the XmlResolver property to null. This will cause the parser to ignore externally referenced entities without throwing an exception when they are read. I just did a google search for "Java ignore enity" and got lots of hits, some of which appear to address this topic. I realize this is not a total answer to your question but it should point you in a useful direction.

bill seacham