tags:

views:

51

answers:

1

I need to verify that a "simplified" DTD is really a subset of a larger DTD, i.e. that documents that are valid according to the "simplified" DTD will also always be valid according to the larger (or "master") DTD.

The simplified DTD is being written now -- it's derived from the master DTD (were it the other way around, one could simply include the smaller DTD into the larger one).

I've found just one DTD diff tool, which is a Perl script: http://search.cpan.org/~ehood/SGML-DTDParse-2.00/bin/dtddiff

Is it any good?

There is also a paper describing an alortithm but no link to any actual implementation.

What other approaches would there be?

+2  A: 

DTDs are really just context-free grammars in disguise. A grammar G represents the set of possible legal strings that comprise the unstated language L(G) that grammar represents.

What you are asking is tantamount to determining if you have G1 and G2, whether L(G1) is a subset of L(G2). My language theory is getting rusty and I don't remember if this is computable in general or not, but my guess this is really hard, because you have to demonstrate that an arbitrary derivation in G1 always has a derivation in G2.

You might be able to answer the question of whether G1 is structured in such a way that you can demonstrate that L(G1) is a subset of L(G2) by demonstrating that each element of G1 is compatible with each element of G2, essentialy by showing that each grammar rule in G1 has a corresponding rule in G2 with elements dropped. Your idea of diffing DTDs seems to be along this line, with the proviso that if the the diffs are large you are stuck with the general problem rather than the simpler one. At least the way you've characterized the problem (G2 is derived from the master DTD) I think you have chance. The purpose of the diff would be to identify compatible rules by finding the least differences.

If you have grammar rule g2 = A ; and another g1 = A that you'd claim are related and that you'd like to check, you'd first have to demonstrate that the string tokens that A derived in G1 is a superset of the string of tokens A derived in G2. This looks just like the original unconstrained problem of comparing two langauges; we're now just comparing the sublanguages for the two rules g1 and g2.

So now I think you have to insist that each subrule reachable by g1 is compatible structurally with a corresponding subrule in g2 to make this practical. I think you can probably write a recursive procedure to check this. What this procedure mostly needs as help is all the set operators (FirstOf, ..) that you tend to find in an LALR parser generator.

On a different front, my company makes Smart Differencer tools, that compute deltas over language constructs in terms of langauge elements and editing operations on those elements. It is parameterized by language definitions. The SmartDifference presently works for a variety of conventional languages (C, C++, C#, COBOL, Java, PHP, Python, ....). XML (and DTDs) are a language, too, for which we have a language definition, and we've built an experimental XML Smart Differencer tools. It ought to work on DTDs just fine. Contact me offline (see bio) if you have further direct interest.

EDIT: Just for grins, I tried the following two DTDs, one derived from the other:

orderform.xml:

<?xml version='1.0' ?>
<!DOCTYPE orderform [

<!ELEMENT orderform (name,company,address,items) >
<!ELEMENT name ( firstname, lastname )>
<!ELEMENT firstname ( #PCDATA )>
<!ELEMENT lastname ( #PCDATA )>
<!ELEMENT company ( #PCDATA )>
<!ELEMENT address ( street, city, country )>
<!ELEMENT street ( #PCDATA )>
<!ELEMENT city( #PCDATA )>
<!ELEMENT country ( zipcode | nation )>
<!ELEMENT zipcode ( #PCDATA )>
<!ELEMENT nation ( #PCDATA )>
<!ELEMENT items (item)+ >
<!ELEMENT item ( partnumber, quantity, unitprice)>
<!ELEMENT partnumber ( #PCDATA )>
<!ELEMENT quantity ( #PCDATA )>
<!ELEMENT unitprice  ( #PCDATA )>
]>

<done/>

and orderform2.xml:

<?xml version='1.0' ?>
<!DOCTYPE orderform [

<!ELEMENT orderform (name,company,location,item) >
<!ELEMENT name ( firstname, lastname )>
<!ELEMENT firstname ( #PCDATA )>
<!ELEMENT lastname ( #PCDATA )>
<!ELEMENT company ( #PCDATA )>
<!ELEMENT location ( street, city, country )>
<!ELEMENT street ( #PCDATA )>
<!ELEMENT city( #PCDATA )>
<!ELEMENT country ( zipcode | nation )>
<!ELEMENT zipcode ( #PCDATA )>
<!ELEMENT nation ( #PCDATA )>
<!ELEMENT item ( partnumber, unitprice)>
<!ELEMENT partnumber ( #PCDATA )>
<!ELEMENT quantity ( #PCDATA )>
<!ELEMENT unitprice  ( #PCDATA )>
]>

<done/>

[See if you can spot the differences yourself, first :-)

And ran the XML SmartDifferencer:

C:\DMS\Domains\XML\Analyzers\SmartDifferencer\Source>DMSSmartDifferencer XML -SuppressSourceCodeForRenamings C:\DMS\Domains\XML\Tool
s\DTD2COBOL\orderform.xml C:\DMS\Domains\XML\Tools\DTD2COBOL\orderform2.xml
Copyright (C) 2009 Semantic Designs; All Rights Reserved
XML SmartDifferencer Version 1.1.1
Copyright (C) 2009 Semantic Designs, Inc; All Rights Reserved; SD Confidential
Powered by DMS (R) Software Reengineering Toolkit
*** Unregistered SmartDifferencer Version 1.1
*** Operating with evaluation limits.

*** Parsing file C:/DMS/Domains/XML/Tools/DTD2COBOL/orderform.xml ...
*** Parsing file C:/DMS/Domains/XML/Tools/DTD2COBOL/orderform2.xml ...
*** Creating suffix tree ...
*** Determining maximal pairs ...
*** Sorting maximal pairs ...
*** Determining differences ...
*** Printing edits ...
Rename 4.1-9.44 to 4.1-9.45 with 'address'->'location' and 'items'~>'item'
Delete 15.1-15.25 merging 15.18-15.21 into 4.44-4.47
<<!ELEMENT items (item)+ >
Delete 16.30-16.38 merging 16.30-16.38 into 15.18-15.28 with 'quantity'~>'partnumber'
<                             quantity,

Yep, that's what I did to get the derived one. (The notation N.M means "line N, column M").

Ira Baxter
Many thanks for this detailed answer; I have a feeling that what I'm after is either very hard or impossible in general. However, since we're writing the smaller DTD now, the best approach may be to be extra carefull (ie, "manually").And then we'll run documents by the two DTDs; if they match the "smaller" one and not the "larger" one, then something's wrong...There may be a "brute force" approach along this line: generate many random instances of documents that verify the small DTD and run them by the larger one, see if it works?I'm also glad to learn about Smart Differencer! Thanks.