views:

491

answers:

2

some code snippets.

The java coding doing the jaxb unmarshaling. pretty straightforward, copied out of tutorials online.

JAXBContext jc = JAXBContext.newInstance( "xmlreadtest" );
Unmarshaller u = jc.createUnmarshaller();

// setting up for validation.
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
StreamSource schemaSource =  new StreamSource(ReadXml.class.getResource("level.xsd").getFile());
Schema schema = schemaFactory.newSchema(schemaSource);
u.setSchema(schema);

// parsing the xml
URL url = ReadXml.class.getResource("level.xml");
Source sourceRoot = (Source)u.unmarshal(url);

The problem element from the xml file. The element contains nothing but ignorable whitespace. Its badly formated as its shown exactly how its found in the file.

<HashLine _id='FI6'
ppLine='1'
origLine='1'
origFname='level.cpp'>
</HashLine>

The xsd element which described this element.

<xs:element name="HashLine">
  <xs:complexType>
    <xs:attribute name="origLine" type="xs:NMTOKEN" use="required" />
    <xs:attribute name="origFname" type="xs:string" use="required" />
    <xs:attribute name="_id" type="xs:ID" use="required" />
    <xs:attribute name="ppLine" type="xs:NMTOKEN" use="required" />
  </xs:complexType>
</xs:element>

the error is

[org.xml.sax.SAXParseException: cvc-complex-type.2.1: Element 'HashLine' must have no character or element information item [children], because the type's content type is empty.]

I've verified the error is coming from that element.

It loads fine with no validation. But I need to use validation as I'm going to be doing heavy changes and additions to the application, and I have to be certain everything gets marshaled/unmarshaled properly.

It also works fine if I change the complexType to include a simpleContext with an xs:string extension. But I'm getting this issue from entities all over, of which there are alot, amd in alot of xsd files. So its not feasible to base every element in the xml documents on xs:string just to get around this issue.

Event though j2se 6 is using the SchemaFactory from apache-xerces, it doesn't seem to accept the 'ignore-whitespace' feature of from xerces. (i.e. schemaFactory.setFeature() )

+1  A: 

You could use the StAX API to filter out empty character blocks prior to validation using an EventFilter:

class WhitespaceFilter implements EventFilter {
  @Override
  public boolean accept(XMLEvent event) {
    return !(event.isCharacters() && ((Characters) event)
        .isWhiteSpace());
  }
}

This can be used to wrap your input:

// strip unwanted whitespace
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inputFactory
    .createXMLEventReader(ReadXml.class.getResourceAsStream("level.xml"));
eventReader = inputFactory.createFilteredReader(eventReader,
    new WhitespaceFilter());

// parsing the xml
Source sourceRoot = (Source) unmarshaller.unmarshal(eventReader);

//TODO: proper error + stream handling
McDowell
+2  A: 

I would suggest writing a very simple XSLT transform to strip out the empty content from those specific elements which are causing the problem (e.g. only the HashLine elements). Then put a processing step before you pass the data through JAXB, by using TransformerFactory, Transformer, and so on, which "cleans" the data using the XSLT transform. You could add sorts of cleaning logic in the XSLT for cases where you find other non-JAXB friendly structures in the source XML.

skaffman
I don't think it is so much that the document isn't JAXB-friendly as it isn't validation-friendly. You're probably right about it being better to target specific elements. I imagine you could do something similar with DOM/XPath, but it wouldn't be as elegant as using XSLT.
McDowell
Yeah, I think a declarative approach will be never than an imperative one in this case. If your XML documents don't conform to the schema, you need to fix that up before passing it through the validator. XSLT is good at that sort of thing.
skaffman
both of the answers provided worked. But I tried the other answer first, as it included some nice sample code. Later I switched to this solution, for various reasons.
DragonFax