views:

138

answers:

1

In my current project we have a large repository of content that was originally published in book form. Much of this content was published in both English and many foreign languages, using mostly Quark Express and later InDesign. This content was exported into a custom XML structure for storage and future use. The issue is that the English XML was exported and then enhanced in both structure and meta data over time by editors which left the structure of the Foreign language XML different from the English Version. For example:

English XML:

<chapter meta="meta data added">
    <section meta="some meta about the section">
        <paragraph>some english paragraph</paragraph>
        <list>
            <li>some english list item</li>
        </list>
    </section>
</chapter>

Foreign XML:

<chapter>
    <section>
        <paragraph>some original foreign language paragraph</paragraph>
    </section>
</chapter>

As you can see there are at times missing elements as well as missing attributes. The problem is at this point we want to compare the Foreign Language structure to the english, add in the missing meta data attributes and elements, and then report on non-translated parts of the XML.

The current process to complete this involves stripping the element data out and placing it into a web application. From there I allow a user to go in and match a foreign language paragraph with its english counterpart (using Jquery to allow them to just click on the item and then the match) and save this data as an attribute (by unique ID). At that point I know which elements match between the two language documents and then I can flow the foreign language content into the English structured XML. This leaves me with the foreign language content (marked by uniqueID) inside the English structured XML which I can query for elements without a unique ID so that I can know which items need to be translated.

This process works fine, however it is quite manual, requiring someone to go in and hand click the paragraphs. With literally hundreds of thousands of pages of content to go through, I am looking for ways to further automate the process. Are there better ways to compare XML documents for structure so that the above goals can be completed with less manual intervention?

The current process uses C#, ASP.Net, Linq to XML and Jquery among other things. But the language and tools are irrelevant! I just want to find a more automated solution. If it uses a DB, no problem. If we need to switch platforms, I don’t mind. It’s a matter of implementation rather than language. Thanks!

+1  A: 

In the past, I have used XSLT to transform two pieces of XML into a common format before comparing them with a textual diff tool (Beyond Compare).

This can work for you even if you require external data to do the conversion - you can pass external data to an XSL Transform using the .NET XslCompiledTransform class, where it can be accessed as parameters of the transform.

John Saunders
This sounds very interesting to me. What sort of external data do you pass into the transform? I am not sure I follow that part of your response.
Tim C
@TimC: you could pass many different things if you wanted to. I only mention it because you say your current process uses "C#, ASP.Net, Linq to XML and Jquery among other things". I thought you might use them because you needed additional data before you could process the XML. An example would be if you needed one or more lookup tables before you could run the transformation - you could pass in the lookup tables as XML documents that could be referenced during the transformation.
John Saunders