views:

596

answers:

3

Say I have an XML document (represented as text, a W3C DOM, whatever), and also an XML Schema. The XML document has all the right elements as defined by the schema, but in the wrong order.

How do I use the schema to "re-order" the elements in the document to conform to the ordering defined by the schema?

I know that this should be possible, probably using XSOM, since the JAXB XJC code generator annotates its generated classes with the correct serialization order of the elements.

However, I'm not familiar with the XSOM API, and it's pretty dense, so I'm hoping one of you lot has some experience with it, and can point me in the right direction. Something like "what child elements are permitted inside this parent element, and in what order?"


Let me give an example.

I have an XML document like this:

<A>
   <Y/>
   <X/>
</A>

I have an XML Schema which says that the contents of <A> must be an <X> followed by a <Y>. Now clearly, if I try to validate the document against the schema, it fails, since the <X> and <Y> are in the wrong order. But I know my document is "wrong" in advance, so I'm not using the schema to validate just yet. However, I do know that my document has all of the correct elements as defined by the schema, just in the wrong order.

What I want to do is to programmatically examine the Schema (probably using XSOM - which is an object model for XML Schema), and ask it what the contents of <A> should be. The API will expose the information that "you need an <X> followed by a <Y>".

So I take my XML document (using a DOM API) and re-arrange and accordingly, so that now the document will validate against the schema.

It's important to understand what XSOM is here - it's a java API which represents the information contained in an XML Schema, not the information contained in my instance document.

What I don't want to do is generate code from the schema, since the schema is unknown at build time. Furthermore, XSLT is no use, since the correct ordering of the elements is determined solely by the data dictionary contained in the schema.

Hopefully that's now explicit enough.

+2  A: 

Your problem translates to this: you have an XSM file that doesn't match the schema and you want to transform it to something that's valid.

With XSOM, you can read the structure in the XSD and perhaps analyze the XML but it still would need additional mapping from the invalid form to the valid form. The use of a stylesheet would be much easier, because you would walk through the XML, using XPath nodes to handle the elements in the proper order. With an XML where you want apples before pears, the stylesheet would first copy the apple node (/Fruit/Apple) before it copies the pear node. That way, no matter of the order in the old file, they would be in the correct order in the new file.

What you could do with XSOM is to read the XSD and generate the stylesheet that will re-order the data. Then transform the XML using that stylesheet. once XSOM has generated a stylesheet for the XSD, you can just re-use the stylesheet until the XSD is modified or another XSD is needed.

Of course, you could use XSOM to copy nodes immediately in the right order. But since this means your code has to walk itself through all nodes and child nodes, it might take some time to process to finish. A stylesheet would do the same, but the transformer will be able to process it all faster. It can work directly on the data while the Java code would have to get/set every node through the XMLDocument properties.


So, I would use XSOM to generate a stylesheet for the XSD which would just copy the XML node by node to re-use over and over again. The stylesheet would only need to be rewritten when the XSD changes and it would perform faster than when the Java API needs to walk through the nodes itself. The stylesheet doesn't care about order so it would always end up in the right order.
To make it more interesting, you could just skip XSOM and try to work with a stylesheet that reads the XSD to generate another stylesheet from it. This generated stylesheet would be copying the XML nodes in the exact order as defined in the stylesheet. Would it be complex? Actually, the stylesheet would need to generate templates for every element and make sure the child elements in this element are processed in the correct order.

When I think about this, I wonder if this has been done before already. It would be very generic and would be able to handle almost every XSD/XML.

Let's see... Using "//xsd:element/@name" you would get all element names in the schema. Every unique name would need to be translated to a template. Within these templates, you would need to process the child nodes of the specific element, which is slightly more complex to get. Elements can have a reference, which you would need to follow. Otherwise, get all child xsd:element nodes it.

Workshop Alex
Yep, that's the way to go.
JG
OK, cool, we're both on the same page now :) I agree that a XSL transform would re-arrange my document more efficiently than manually poking around in the DOM, but the initial problem of using the XSOM API to find out what the order *should* be remains, regardless of the mechanism I use to perform the re-ordering itself.
skaffman
I suddenly wonder if it isn't possible to use a stylesheet to transform an XSD into an XML-copying stylesheet. Would make an interesting cross-platform solution. If you're already familiar with XSD's and XSLT's then this might be easier than having to learn more about XSOM.
Workshop Alex
I dunno, schemas can be fearsomely complex, especially the ones I'm working with.... extended types, substitution groups, all that stuff. Scary.
skaffman
Keep an eye open for this Q: http://stackoverflow.com/questions/1437443/ ;-)
Workshop Alex
+3  A: 

I don't have a good answer to this yet, but I have to note that there is potential for ambiguity there. Consider this schema:

<xs:element name="root">
  <xs:choice>
    <xs:sequence>
      <xs:element name="foo"/>
      <xs:element name="bar">
        <xs:element name="dee">
        <xs:element name="dum">
      </xs:element>
    </xs:sequence>
    <xs:sequence>
      <xs:element name="bar">
        <xs:element name="dum">
        <xs:element name="dee">
      </xs:element>
      <xs:element name="foo"/>
    </xs:sequence>
  </xs:choice>
</xs:element>

and this input XML:

<root>
  <foo/>
  <bar>
    <dum/>
    <dee/>
  </bar>
</root>

This could be made to comply with the schema either by reordering <foo> and <bar>, or by reordering <dee> and <dum>. There doesn't seem to be any reason to prefer one over another.

Pavel Minaev
Well spotted, that's a fair point. In my case, however, I know that such an ambiguity wouldn't arise, since every `<bar>` would have the same schema type, with the same child ordering.
skaffman
Good point (+1), but how common would such constructions be? And why would someone use such a construction?
Workshop Alex
+1  A: 

Basically you want to take the root element and from there recursively look at the children in the document and the children defined in the schema and make the order match.

I'll give you a C#-syntax solution, since that's what I code in day and night, it's pretty close to Java. Note that I'll have to take guesses about XSOM since I don't know it's API. I've also made up the XML Dom methods since giving your C# ones propbably wouldn't help :)

// assume first call is SortChildrenIntoNewDocument( sourceDom.DocumentElement, targetDom.DocumentElement, schema.RootElement )

public void SortChildrenIntoNewDocument( XmlElement source, XmlElement target, SchemaElement schemaElement )
{
    // whatever method you use to ask the XSOM to tell you the correct contents
    SchemaElement[] orderedChildren = schemaElement.GetChildren();
    for( int i = 0; i < orderedChildren.Length; i++ )
    {
        XmlElement sourceChild = source.SelectChildByName( orderedChildren[ i ].Name );
        XmlElement targetChild = target.AddChild( sourceChild )
        // recursive-call
        SortChildrenIntoNewDocument( sourceChild, targetChild, orderedChildren[ i ] );
    }
}

I wouldn't recommend a recursive method if it's going to be a deep tree, in that case you would have to create some 'tree walker' type objects. The advantage of that approach is you'll be able to handle more complex things like when the schema says you can have 0-or-more of an element you can keep processing source nodes until there's no more that match, then move the schema walker on from there.

Timothy Walters
It's not as simple as that, because there's no `getChildren()` - as I understand, there could be things like `xs:choice`, or `maxOccurs > 1`, so there may not even be a single specific element as Nth child - it would be "X or Y or ...", essentially arbitrary long.
Pavel Minaev
I figured it might be like that (given the nature of XSD), so the 2nd option I mentioned is the only way to go really. If no-one else comes up with a solution I can post an example of how it would work.
Timothy Walters