tags:

views:

292

answers:

3

I'm parsing (a lot of) XML files that contain entity references which i dont know in advance (can't change that fact).

For example:

xml = "<tag>I'm content with &funny; &entity; &references;.</tag>"

when i try to parse this using the following code:

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));
final Document d = db.parse(is);

i get the following exception:

org.xml.sax.SAXParseException: The entity "funny" was referenced, but not declared.

but, what i do want to achieve is, that the parser replaces every entity that is not declared (unknown to the parser) with an empty String ''. Or even better, is there a way to pass a map to the parser like:

Map<String,String> entityMapping = ...
entityMapping.put("funny","very");
entityMapping.put("entity","important");
entityMapping.put("references","stuff");

so that i could do the following:

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));

db.setEntityResolver(entityMapping);
final Document d = db.parse(is);

if i would obtain the text from the document using this example code i should receive:

I'm content with very important stuff.

Any suggestions? Of course, i already would be happy to just replace the unknown entity's with empty strings.

Thanks,

+2  A: 

Since your XML input seems to be available as a String, could you not do a simple pre-processing with regular expression replacement?

xml = "...";

/* replace entities before parsing */
for (Map.Entry<String,String> entry : entityMapping.entrySet()) {
   xml = xml.replaceAll("&" + entry.getKey() + ";", entry.getValue());
}

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
...

It's quite hacky, and you may want to spend some extra effort to ensure that the regexps only match where they really should (think <entity name="&don't-match-me;"/>), but at least it's something...

Of course, there are more efficient ways to achieve the same effect than calling replaceAll() a lot of times.

Thomas
i dont know all entities in advance (as i already mentioned in my question) i only know a subset. and where is the point in using xml for which already mature parsers exists when i end up writing my own parser so i can work with xml data?
Chris
The point is this: those mature parsers are designed to handle well-formed XML. You don't have well-formed XML and you're looking for workarounds to make the parsers handle it anyway.
Paul Clapham
well, we can argue a lot, still i have this problem to solve.
Chris
@Chris: I was referring to your original code sample in my use of the 'entityMapping' HashMap. If you add something like `xml = xml.replaceAll("]*;", "");` after the for loop, my method enables you to deal with unknown entity references. You're right that it's annoying to use such work-arounds, but since you're trying to achieve something an XML was not designed to deal with, a hack like the one above might indeed be a way for you to solve this problem. And frankly, it's not much of a "parser" that you have to write, it's rather a (very simple) preprocessor.
Thomas
@Thomas: Thanks for your suggestion.
Chris
A: 

You could add the entities at the befinning of the file. Look here for more infos.

You could also take a look at this thread where someone seems to have implemented an EntityResolver interface (you could also implement EntityResolver2 !) where you can process the entities on the fly (e.g. with your proposed Map).

WARNING: there is a bug! in jdk6, but you could try it with jdk5

Karussell
there is no DTD file that defines how to translate the references and i also dont know all the entity references that may occur in the document so i can't create it myself. i only know a small subset of the references that frequently occur but also cant say how "big" that subset is.
Chris
ok, but either you transform those enities to RAW CDATA sections or you could 'skip' or transform those entities on the fly into valid xml-pure-text or sth. like this. Otherwise you will not have a chance.
Karussell
EntityResolver2 -> see updated answer
Karussell
+2  A: 

The StAX API has support for this. Have a look at XMLInputFactory, it has a runtime property which dictates whether or not internal entities are expanded, or left in place. If set to false, then the StAX event stream will contain instances of EntityReference to represent the unexpanded entities.

If you still want a DOM as the end result, you can chain it together like this:

XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();

String xml = "my xml";
StringReader xmlReader = new StringReader(xml);
XMLEventReader eventReader = inputFactory.createXMLEventReader(xmlReader);
StAXSource source = new StAXSource(eventReader);
DOMResult result = new DOMResult();

transformer.transform(source, result);

Node document = result.getNode();

In this case, the resulting DOM will contain nodes of org.w3c.dom.EntityReference mixed in with the text nodes. You can then process these as you see fit.

skaffman