tags:

views:

1089

answers:

2

I am in the middle of making a script for doing translation of xml documents. It's actually pretty cool, the idea is (and it is working) to take an xml file (or a folder of xml files) and open it, parse the xml, get whatever is in between some tags and using the google translate api's translate it and replace the content of the xml files.

As I said, I have this working, but only in fairly strict xml formatted documents, now I have to make it compatible with documents formatted differently. So my idea was:

Parse the xml, find a node, e.g:

<template>lorem lipsum dolor mit amet<think><set name="she">Ada</set></think></template>

Save this as a string, do some regex search and replace on this string. But i sadly have no clue on how to proceed. I want to search to the string (xml node) find text that is inbetween tags, in this case "lorem lipsum dolor mit amet" and "Ada", call a function with those text's as a parameter and then insert the result of the function in the same place as it originated from.

The reason i cant just get the text and rebuild the xml formatting is that there will be differently formatted xml nodes so i need it to be identical...

+7  A: 

Don't try to parse XML using regular expressions! XML is not regular and therefore regular expressions are not suited to doing this kind of task.

Use an actual XML parser. Many of these are readily available for Python. A quick search has lead me to this SO question which covers how to use XPath in Python.

Welbog
+3  A: 

ElementTree would be a good choice for this kind of parsing. It is easy to use and lightweight and supports outputting XML after you do operations on it (as simple as calling write()). It comes packaged with Python standard libraries in the latest versions (I believe 2.6+).

Kai