tags:

views:

131

answers:

1

My aim is to write an XML file with few tags whose values are in the regional language. I'm using Python to do this and using IDLE (Pythong GUI) for programming.

While I try to write the words in an xmls file it gives the following error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

For now, I'm not using any xml writer library; instead, I'm opening a file "test.xml" and writing the data into it. This error is encountered by the line: f.write(data) If I replace the above write statement with print statement then it prints the data properly on the Python shell.

I'm reading the data from an Excel file which is not in the UTF-8, 16, or 32 encoding formats. It's in some other format. cp1252 is reading the data properly.

Any help in getting this data written to an XML file would be highly appreciated.

+5  A: 

You should .decode your incoming cp1252 to get Unicode strings, and .encode them in utf-8 (by far the preferred encoding for XML) at the time you write, i.e.

f.write(unicodedata.encode('utf-8'))

where unicodedata is obtained by .decode('cp1252') on the incoming bytestrings.

It's possible to put lipstick on it by using the codecs module of the standard Python library to open the input and output files each with their proper encodings in lieu of plain open, but what I show is the underlying mechanism (and it's often, though not invariably, clearer and more explicit to apply it directly, rather than indirectly via codecs -- a matter of style and taste).

What does matter is the general principle: translate your input strings to unicode as soon as you can right after you obtain them, use unicode throughout your processing, translate them back to byte strings at late as you can just before you output them. This gives you the simplest, most straightforward life!-)

Alex Martelli
Thanks for such a quick reply. :) I actually did the same operation while I was getting the error... File "C:\test.py", line 64, in main uData = items.decode('cp1252') File "C:\Python26\lib\encodings\cp1252.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table)UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)I didn't get the actual reasn why this is happening. Probably 'cp1252' might not the actual format that will be used to decode. What can i do in such a case? :(
Bobby
We can close this thread. I got my problem solved. Actually the data I had was already the unicode data. It was not supposed to be decoded. However, for writing in XML file I used the following code that fixed my problem. import cgi dataToWrite = cgi.escape(data).encode("ascii", "xmlcharrefreplace") Tons of thanks for the help.
Bobby