In my current project we have a large repository of content that was originally published in book form. Much of this content was published in both English and many foreign languages, using mostly Quark Express and later InDesign. This content was exported into a custom XML structure for storage and future use. The issue is that the English XML was exported and then enhanced in both structure and meta data over time by editors which left the structure of the Foreign language XML different from the English Version. For example:
English XML:
<chapter meta="meta data added">
<section meta="some meta about the section">
<paragraph>some english paragraph</paragraph>
<list>
<li>some english list item</li>
</list>
</section>
</chapter>
Foreign XML:
<chapter>
<section>
<paragraph>some original foreign language paragraph</paragraph>
</section>
</chapter>
As you can see there are at times missing elements as well as missing attributes. The problem is at this point we want to compare the Foreign Language structure to the english, add in the missing meta data attributes and elements, and then report on non-translated parts of the XML.
The current process to complete this involves stripping the element data out and placing it into a web application. From there I allow a user to go in and match a foreign language paragraph with its english counterpart (using Jquery to allow them to just click on the item and then the match) and save this data as an attribute (by unique ID). At that point I know which elements match between the two language documents and then I can flow the foreign language content into the English structured XML. This leaves me with the foreign language content (marked by uniqueID) inside the English structured XML which I can query for elements without a unique ID so that I can know which items need to be translated.
This process works fine, however it is quite manual, requiring someone to go in and hand click the paragraphs. With literally hundreds of thousands of pages of content to go through, I am looking for ways to further automate the process. Are there better ways to compare XML documents for structure so that the above goals can be completed with less manual intervention?
The current process uses C#, ASP.Net, Linq to XML and Jquery among other things. But the language and tools are irrelevant! I just want to find a more automated solution. If it uses a DB, no problem. If we need to switch platforms, I don’t mind. It’s a matter of implementation rather than language. Thanks!