ansaurus

Question

Reading UTF-8 XML and writing it to a file with Python

Answer 1

+1 A:

You'll need to remove the call to encode() - that is, replace nodeValue.encode("utf-8") with nodeValue - and then change the call to open() to

with open("uiStrings-fi.py", "w", "utf-8") as f:

This uses a "Unicode-aware" version of open() which you will need to import from the codecs module, so also add

from codecs import open

to the top of the file.

The issue is that when you were calling nodeValue.encode("utf-8"), you were converting a Unicode string (Python's internal representation that can store all Unicode characters) into a regular string (which can only store single-byte characters 0-255). Later on, when you construct the line to write to the output file, names[i] is still a Unicode string but values[i] is a regular string. Python tries to convert the regular string to Unicode, which is the more general type, but because you don't specify an explicit conversion, it uses the ASCII codec, which is the default, and ASCII can't handle characters with byte values greater than 127. Unfortunately, several of those do occur in the string values[i] because the UTF-8 encoding uses those upper-range bytes frequently. So Python complains that it sees a character it can't handle. The solution, as I said above, is to defer the conversion from Unicode to bytes until the last possible moment, and you do that by using the Unicode-aware version of open (which will handle the encoding for you).

Now that I think about it, instead of what I said above, an alternate solution would be to replace names[i] with names[i].encode("utf-8"). That way, you convert names[i] into a regular string as well, and Python has no reason to try to convert values[i] back to Unicode. Although, one could make the argument that it's good practice to keep your strings as Unicode objects until you write them out to the file... if nothing else, I believe unicode becomes the default in Python 3.

David Zaslavsky 2010-06-10 06:12:53

Answer 2

A:

The XML parser decodes the UTF-8 encoding of the input when it reads the file and all the text nodes and attributes of the resulting DOM are then unicode objects. When you select the interesting data from the DOM, you re-encode the values as UTF-8, but you don't encode the names. The resulting values array contains encoded byte strings while the names array still contains unicode objects.

In the line where the encoding error is thrown, Python tries to concatenate such a unicode name and a byte string value. To do so, both values have to be of the same type and Python tries to convert the byte string values[i] to unicode, but it doesn't know that it's UTF-8 encoded and fails when it tries to use the ASCII codec.

The easiest way to work around this would be to keep all the strings as Unicode objects and just encode them to UTF-8 when they are written to the file:

values.append(value[0].firstChild.nodeValue) # encode not yet
...
f.write(line.encode('utf-8')) # but now

sth 2010-06-10 06:16:57

ansaurus

tags:

views:

answers:

Reading UTF-8 XML and writing it to a file with Python

related questions