views:

1249

answers:

4

In Java, I can validate an XML document against an XSD schema using javax.xml.validation.Validator, or against a DTD by simply parsing the document using org.xml.sax.XMLReader.

What I need though is a way of programmatically determining whether the document itself validates against a DTD (i.e. it contains a <!DOCTYPE ...> statement) or an XSD. Ideally I need to do this without loading the whole XML document into memory. Can anyone please help?

(Alternatively, if there's a single way of validating an XML document in Java that works for both XSDs and DTDs - and allows for custom resolving of resources - that would be even better!)

Many thanks,

A

+1  A: 

See the package description for javax.xml.validation. It contains information about and examples for validating both XSDs and DTDs

Kevin
A: 

Could you just use string comparisons?

public enum Type {
    XSD,
    DTD,
    UNKNOWN
};

public Type findType(File f) throws FileNotFoundException, IOException {
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new FileReader(f));
        String line;
        // may want to cut this loop off after a certain number of lines
        while ((line = reader.readLine()) != null) {
            line = line.toLowerCase();
            if (line.contains("<!doctype"))
                return Type.DTD;
            else if (line.contains("xsi:schemaLocation"))
                return Type.XSD;
        }
    } finally {
        if (reader != null) {
            try {
                reader.close();
            } catch (IOException ex) {}
        }
    }
    return Type.UNKNOWN;
}
Michael Myers
Nice idea! In the end, I did something similar using a StAX XMLStreamReader. Thanks for your help.
Alan Gairey
@mmyers. This method makes many assumptions about the character set of the XML and gives other opportunities for failure - commented out doctype, for example.
McDowell
@McDowell: Yep. Do you have a better way?
Michael Myers
A: 

Hey Alan,

could you maybe post an code example of how you validating an xml against a given dtd. It seems to be easy for a schema but I am struggling to find how to do it with a dtd.

Thanks a lot,

Denis.

Ok I found it:

 XMLReader reader = XMLReaderFactory.createXMLReader();

 // try to activate validation
 try {
    // Turn on validation
    reader.setFeature("http://xml.org/sax/features/validation", true);
    // Ensure namespace processing is on (the default)
    reader.setFeature("http://xml.org/sax/features/namespaces", true);
 } catch (SAXException e) {
  System.err.println("Cannot activate validation.");
 }

 try {
  reader.parse("testFiasRequest.xml");
 } catch (IOException e) {
  System.err.println("I/O exception reading XML document");
 } catch (SAXException e) {
  System.err.println("XML exception reading document.");
 }
Denis
+1  A: 

There is no 100% foolproof process for determining how to validate an arbitrary XML document.

For example, this version 2.4 web application deployment descriptor specifies a W3 schema to validate the document:

<?xml version="1.0" encoding="UTF-8"?>
<web-app id="WebApp_ID" version="2.4"
    xmlns="http://java.sun.com/xml/ns/j2ee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd"&gt;

However, this is an equally valid way of expressing the same thing:

<?xml version="1.0" encoding="UTF-8"?>
<web-app id="WebApp_ID" version="2.4"
    xmlns="http://java.sun.com/xml/ns/j2ee"&gt;

RELAX NG doesn't seem to have a mechanism that offers any hints in the document that you should use it. Validation mechanisms are determined by document consumers, not producers. If I'm not mistaken, this was one of the impetuses driving the switch from DTD to more modern validation mechanisms.

In my opinion, your best bet is to tailor the mechanism detector to the set of document types you are processing, reading header information and interpreting it as appropriate. The StAX parser is good for this - because it is a pull mechanism, you can just read the start of the file and then quit parsing on the first element.

Link to more of the same and sample code and whatnot.

McDowell