views:

40

answers:

2

Hello,

I'm trying to write a python script that takes in one or two xml files and outputs one or two new files based on the contents of the input files. I was trying to write this script using the minidom module. However, the input files contain a number of instances of the escape character

inside node attributes. Unfortunately, in the output files, these characters have been converted to different characters, which seem to be newline characters.

For example, a line in the input file such as:

<Entry text="For English For Hearing Impaired&#xa;Press 3 on Keypad"

Would be output as

<Entry text="For English For Hearing Impaired
Press 3 on Keypad"

I read that minidom is causing this, as it doesn't allow escape characters in xml attributes (I think). Is this true? And, if so, what's the best tool/method to use to parse an xml file into a python document, manipulate nodes and exchange them with other documents, and output documents back to new files?

If it helps, I was also parsing and saving these files using 'utf-8' encoding. I don't know if this is part of the problem or not. Thanks for any help anyone can give.

-Alex Kaiser

+2  A: 

I haven't used Python's standard xml modules since discovering lxml. It can do everything you're looking for. For example...

input.xml:

<?xml version="1.0" encoding='utf-8'?>
<root>
  <Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired&#xa;Press 3 on Keypad" />
</root>

and:

>>> from lxml import etree
>>> with open('input.xml') as f:
...     root = etree.parse(f)
...
>>> buttons = root.xpath('//Button3')
>>> buttons
[<Element Button3 at 101071f18>]
>>> buttons[0]
<Element Button3 at 101071f18>
>>> buttons[0].attrib
{'yposition': '250', 'language1': 'For English For Hearing Impaired\nPress 3 on Keypad', 'fontsize': '16'}
>>> buttons[0].attrib['foo'] = 'bar'
>>> s = etree.tostring(root, xml_declaration=True, encoding='utf-8', pretty_print=True)
>>> print(s)
<?xml version='1.0' encoding='utf-8'?>
<root>
  <Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired&#10;Press 3 on Keypad" foo="bar"/>
</root>
>>> with open('output.xml','w') as f:
...     f.write(s)
>>>
ma3
A: 

&#xa; is the XML entity for character 0x0a, or a newline. The parser is correctly parsing the XML and giving you the characters indicated. If you want to forbid or otherwise deal with newlines in attributes, you are free to do whatever you like with them after the parser gives them to you.

Ned Batchelder