views:

178

answers:

2

I have several thousand xml files generated from java properties files prepared for translation in the TTX format. They contain quite a few variables, that I need to protect from the translators, as they often break such things. The variables are in the form of numbers or occasionally text between a pair of curly braces eg. {0}, {this}.

I need to surround these variables with an xml element if they are not already an attribute and if they are not already part of the inner text of a ut element, like so:

<ut DisplayText="{0}">&lt;{0}&gt;</ut>

My input looks like this:

<ut Type="start"DisplayText="string">&lt;string&gt;</ut> text string {0} 
<ut DisplayText="{1}">&lt;{1}&gt;</ut> in:
<ut DisplayText="\n">&lt;\n/&gt;</ut> {2}.
<ut Type="end" DisplayText="resource">&lt;/resource&gt;</ut>

The correct output should be this:

<ut Type="start"DisplayText="string">&lt;string&gt;</ut> text string <ut DisplayText="{0}">{0}</ut> 
<ut DisplayText="{1}">&lt;{1}&gt;</ut> in:
<ut DisplayText="\n">&lt;\n/&gt;</ut> <ut DisplayText="{2}">{2}</ut>.
<ut Type="end" DisplayText="resource">&lt;/resource&gt;</ut>

My initial approach was to use a regular expression to match the term in the braces and just build the xml elements around it with pattern substitution. This approach fails when the pattern is present found as in the first code block above.

Previous find and replace patters (in notepad++):

Find

({[A-Za-z0-9]*})

Replace

<ut DisplayText="\1">\1</ut>

It is beginning to look like regex is not the right tool for the job, so I would like some suggestions on better approaches to take, different tools, or even just a more complete regex that may allow me to solve this quickly and repeatably.

Update: The problem turned out to be a little more complex than previously envisioned. It seems there are also a couple more things that needed protecting, involving some rather obscure syntax, mixing variables with text in what appears to be some kind of conditional statement. From memory:

{o,choice|1#1  error|1&lt;{0,number,integer} errors}

Where "error" and "errors" are translatable and should not be protected. The simplest solution we have at present is to run the above regex, fix the odd few of erros it creates and then run a couple more normal find & replace passes for the more complex items. It could be abstracted out as regex, but right now there is not much point in doing that.

I appreciate the pointers to xslt and other editors with better regex support, in addition to the improved expressions offered. I will have a play with some of the options when time allows.

+1  A: 

Let me know if my assumption is wrong, but from your example it seems you want to change text that is in {} and not in a <ut> element. To me this seems like an easy use of XSLT. Simply output UT elements as they are and process any text in between.

quadelirus
This is correct, plus the additional item at the bottom of the question with the unusual syntax.
IanGilham
A: 

Why not try using the expression

(?<=.){[A-Za-z0-9]+}(?=.$)

This would find the { with 1 or more letters or numbers and the } when this pattern follows the tag and any number of spaces AND is followed by any number of spaces and a line break.

Mentee
I originally tried something similar, but given that the variables fall within natural language text, with all its ambiguity and bad formatting, this does not cover all permutations in which the variables may appear. Refer to the update for additional annoyances.
IanGilham