tags:

views:

277

answers:

3

I am attempting to fix some bilingual xml files using regular expressions to match known patterns of erroneous content and substituting the correct values. Most of the problems in the xml files can be considered typos or redundant data.

I do have a text processing tool that works in software without any regex support, but the whole situation would be so much easier if I could just use sed or something similar to script up a batch job and leave it overnight. An example sed script that should solve the problem might look like the following:

#!/bin/sed -f
s/<prop type="Att::Status">New/<prop type="Att::Status">Not Validated/g
s/<prop type="Att::Status">Approved/<prop type="Att::Status">Validated/g
....

I have discovered that sed doesn't like UTF16 files much, and since we are dealing with bilingual xml in 34 different language combinations, it could be very dangerous to use a tool like iconv to wrap around the sed script. Most charset conversion tools cause corruption of some kind and I'd rather not spend the rest of the week deciding which languages the script works correctly on.

It is also worth mentioning that the xml is full of the accumulated translations of a client over the last few years, so there is going to be plenty of mal-formed syntax in there that may trip up some tools.

So in summary, sed + iconv is too risky, I have a basic global text replace tool, I have Notepad++, I even have a list of expressions for replacement in the sed syntax. But is there an easier/better way?

A: 

I'd have thought that xslt is your best bet for this sort of thing.

Tom
+1  A: 

See XMLStarlet. It's a command line tool set for read/manipulating XML.

In particular, the xml ed command is probably what you want. You can specify XPaths of what you want to change, and how to change it. It'll respect the specified XML character encoding etc., which your standard command-line tools will not.

Brian Agnew
Thanks. This looks like a good way forward without having to deal with the complexities of xslt.
IanGilham
+1  A: 

I don't know if the complexities of XML Starlet are any less than the complexities of XSLT - most of the complexity is actually in the XPath that you're going to use to find the nodes that you're going to change.

If you were to use XSLT, you'd simply create an identity transform and then add a template to change the text nodes you're interested in:

<xsl:template match="prop[@type='Att::Status']/text()">
   <xsl:choose>
      <xsl:when test=". = 'New'">Validated</xsl:when>
      <xsl:when test=". = 'Approved'">Not Validated</xsl:when>
      <xsl:otherwise>
         <xsl:copy/>
      </xsl:otherwise>
   </xsl:choose>
</xsl:template>

Or you could go nuts and specify the mapping in an external XML file, e.g.:

<map>
   <text value="New">Validated</text>
   <text value="Approved">Not Validated</text>
</map>

Then, in your XSLT:

<xsl:variable name="map" select="document('map.xml')/map/text"/>

<xsl:template match="prop[@type='Att::Status']/text()">
   <xsl:choose>
      <xsl:when test="$map[@value=current()]">
         <xsl:copy-of select="$map[@value=current()]/text()"/>
      </xsl:when>
      <xsl:otherwise>
         <xsl:copy/>
      </xsl:otherwise>
   </xsl:choose>
</xsl:template>
Robert Rossney
Seems fairly straight forward, but the language is just so ugly. At least XPath is relatively succinct and legible.+1 for a nice example
IanGilham
I think the language is quite elegant, myself, but I probably have Stockholm syndrome.
Robert Rossney