tags:

views:

63

answers:

2

Hi,

I found a few tools on the web which generate XML Schema for a given XML data instance. Something like - link text

I'm also thinking of developing one, but I'm kind of confused with the evaluation of the generated schema.

How can the schema generated be evaluated i.e. the schema generated confirms to the given data? Is there any means to formulate some kind of an accuracy measure which says that the XML schema generated is 80% correct or 90% correct for the given XML data?

Please help me out with any pointers.

cheers

A: 

A schema generated from an existing XML document will only be as good as the original XML. If your sample XML is a complete example of the XML that will be used, your generated schema will work. If it is incomplete or poorly formed, it won't.

Dave Swersky
I'm wondering can I evaluate the generated schema's accuracy in case I don't have the original XML Schema and I just have an XML document and the corresponding XML Schema generated by the tool?
Andriyev
That's simply not true: a sample is just a sample. Even if it is complete, it simply cannot represent all the legal variations of the XML content.
bortzmeyer
@bortzmeyer: I was referring to the schema definition. A complete sample will generate a schema that can be used for validation.
Dave Swersky
+2  A: 

I believe you are asking for the impossible. An automatically generated schema (I use Examplotron) can never be perfectly accurate because the generation tool does not have enough information.

For instance, if there is an element <foobar> in the XML document, how could the generation tool know if it is mandatory or not? If more than one value is accepted? Without knowing the original schema, you have no way of saying if the generated schema is accurate or not. (Examplotron solves the problem by allowing the author to put structured comments in the XML file, to guide the program.)

Here is an example. With this XML file:

<data>
<foo>1</foo>
<bar>text</bar>
<baz/>
</data>

Examplotron generated this schema (a bit edited):

start =
  element data {
    element foo { xsd:integer },
    element bar { text },
    element baz { empty }
  }

Note the xsd:integer in the element <foo>. Nice inference but, is it accurate? May be <foo> was supposed to be of a more general type like xsd:any...

bortzmeyer
Thanks for the reply. It makes sense. I had a sneaky feeling that I was hitting the wall. :)
Andriyev