I am working on a system that should be able to read any (or at least, any well-formed) XML file, manipulate a few nodes and write them back into that same file. I want my code to be as generic as possible and I don't want
- hardcoded references to Schema/Doctype information anywhere in my code. The doctype information is in the source document, I want to keep exactly that doctype information and not provide it again from within my code. If a document has no DocType, I won't add one. I do not care about the form or content of these files at all, except for my few nodes.
- custom EntityResolvers or StreamFilters to omit or otherwise manipulate the source information (It is already a pity that namespace information seems somehow inaccessible from the document file where it is declared, but I can manage by using uglier XPaths)
- DTD validation. I don't have the referenced DTDs, I don't want to include them and Node manipulation is perfectly possible without knowing about them.
The aim is to have the source file entirely unchanged except for the changed Nodes, which are retrieved via XPath. I would like to get away with the standard javax.xml stuff.
My progress so far:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setAttribute("http://xml.org/sax/features/namespaces", true);
factory.setAttribute("http://xml.org/sax/features/validation", false);
factory.setAttribute("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
factory.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(false);
factory.setIgnoringComments(false);
factory.setValidating(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(inStream));
This loads the XML source into a org.w3c.dom.Document successfully, ignoring DTD validation. I can do my replacements and then I use
Source source = new DOMSource(document);
Result result = new StreamResult(getOutputStream(getPath()));
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
to write it back. Which is nearly perfect. But the Doctype tag is gone, no matter what I do. While debugging, I saw that there is a DeferredDoctypeImpl [log4j:configuration: null] object in the Document object after parsing, but it is somehow wrong, empty or ignored. The file I tested on starts like this (but it is the same for other file types):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/" debug="false">
[...]
I think there are a lot of (easy?) ways involving hacks or pulling additional JARs into the project. But I would rather like to have it with the tools I already use.