views:

394

answers:

4

I am using a third party application and would like to change one of its files. The file is stored in XML but with an invalid doctype.

When I try to read use a it errors out becuase the doctype contains "file:///ReportWiz.dtd" (as shown, with quotes) and I get an exception for cannot find file. Is there a way to tell the docbuilder to ignore this? I have tried setValidate to false and setNamespaceAware to false for the DocumentBuilderFactory.

The only solutions I can think of are

  • copy file line by line into a new file, omitting the offending line, doing what i need to do, then copying into another new file and inserting the offending line back in, or
  • doing mostly the same above but working with a FileStream of some sort (though I am not clear on how I could do this..help?)
DocumentBuilderFactory docFactory = DocumentBuilderFactory
        .newInstance();
docFactory.setValidating(false);
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(file);
+1  A: 

My first thought was dealing with it as a stream. You could make a new adapter at some level and just copy input to output except for the offending text.

If the file is shortish (under half a gig or so) you could also read the entire thing into a byte array and make your modifications there, then create a new stream from the byte array into your builder.

That's the advantage of the amazingly bulky way Java handles streams, you actually have a lot of flexibility.

Bill K
could you maybe help me with some example code(or a link), this sounds a lot like what I want to do.
Adam Lerman
Bill K
Here is an example that has very simple filtering (it excludes unprintable characters from the stream I believe). http://www.cafeaulait.org/slides/sd2000west/javaio/44.html Your case is harder because you need to recognize a multi-character pattern.
Bill K
A: 

Another thing I was debating was storing all of the file in a string, then doing my manipulations and wiring the String out to a file.None of these seem clean or easy, but what is the best way to do this?

Adam Lerman
+2  A: 

Handle resolution of the DTD manually, either by returning a copy of the DTD file (loaded from the classpath) or by returning an empty one. You can do this by setting an entity resolver on your document builder:

 EntityResolver er = new EntityResolver() {
  @Override
  public InputSource resolveEntity(String publicId, String systemId)
    throws SAXException, IOException {
   if ("file:///ReportWiz.dtd".equals(systemId)) {
    System.out.println(systemId);
    InputStream zeroData = new ByteArrayInputStream(new byte[0]);
    return new InputSource(zeroData);
   }
   return null;
  }
 };
McDowell
More complex then I needed. I didnt try this but I was really only looking for a way to ignore it completely.
Adam Lerman
+4  A: 

Tell your DocumentBuilderFactory to ignore the DTD declaration like this:

docFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

See here for a list of available features.

You also might find JDOM a lot easier to work with than org.w3c.dom:

org.jdom.input.SAXBuilder builder = new SAXBuilder();
builder.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
org.jdom.Document doc = builder.build(file);
Sophie Tatham
EXACTLY what I needed. THANKS!! Welcom to SO.
Adam Lerman