I am attempting to fix some bilingual xml files using regular expressions to match known patterns of erroneous content and substituting the correct values. Most of the problems in the xml files can be considered typos or redundant data.
I do have a text processing tool that works in software without any regex support, but the whole situation would be so much easier if I could just use sed or something similar to script up a batch job and leave it overnight. An example sed script that should solve the problem might look like the following:
#!/bin/sed -f
s/<prop type="Att::Status">New/<prop type="Att::Status">Not Validated/g
s/<prop type="Att::Status">Approved/<prop type="Att::Status">Validated/g
....
I have discovered that sed doesn't like UTF16 files much, and since we are dealing with bilingual xml in 34 different language combinations, it could be very dangerous to use a tool like iconv to wrap around the sed script. Most charset conversion tools cause corruption of some kind and I'd rather not spend the rest of the week deciding which languages the script works correctly on.
It is also worth mentioning that the xml is full of the accumulated translations of a client over the last few years, so there is going to be plenty of mal-formed syntax in there that may trip up some tools.
So in summary, sed + iconv is too risky, I have a basic global text replace tool, I have Notepad++, I even have a list of expressions for replacement in the sed syntax. But is there an easier/better way?