ansaurus

Question

Python: How do I force iso-8859-1 file output?

Answer 1

A:

I think it's just:

outputFile = file( "textbase.tab", "wb" )
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write((complete_line + "\n").encode("iso-8859-1"))
    outputFile.close()

As you alluded to, you need to make sure you are decoding the RTF file correctly too. For this to work, k and v should be unicode objects.

Matthew Flaschen 2010-02-03 12:22:10

Thank you. I have just tried this code, but get: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 753: ordinal not in range(128)". I'll now try making sure that k and v are unicode objects, as suggested above.

AP257 2010-02-03 13:01:05

Answer 2

A:

The main problem here is that you don't know what encoding your data is in. If we assume you are correct in that your file ends up being in Mac OS Roman, then you need to decode the data to unicode first, and then encode it as iso-8859-1.

inputFile = open("input.rtf", "rb") # The b flag is just a marker in Python 2.
data = inputFile.read().decode('mac_roman')
textData = yourparsefunctionhere(data)

outputFile = open( "textbase.tab", "wb" ) # don't use file()
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write((complete_line + "\n").encode("iso-8859-1"))
    outputFile.close()

But I wouldn't be surprised, since it's RTF, if it's Windows encoded, so you might want to try that too. I don't know how RTF specifies the encoding.

Lennart Regebro 2010-02-03 12:27:11

If you use r instead of rb, Windows will mangle \r\n into \r (incl. on Python 2.6).

Matthew Flaschen 2010-02-03 12:32:18

From the docs: "Append 'b' to the mode to open the file in binary mode, on systems that differentiate between binary and text files; on systems that don’t have this distinction, adding the 'b' has no effect." Having b or t (or none of them) makes no difference at all on Unix. You may be thinking of "U", which is universal newlines. *It* maches line-endings (never have U on write!)What systems that differentiate between text and binary files I don't know. Unix sure doesn't.

Lennart Regebro 2010-02-03 12:37:46

Answer 3

+4 A:

Simply use the codecs module for writing the file:

import codecs
outputFile = codecs.open("textbase.tab", "w", "ISO-8859-1")

Of course, the strings you write have to be Unicode strings (type unicode), they won't be converted if they are plain str objects (which are basically just arrays of bytes). I guess you are reading the RTF file with the normal Python file object as well, so you might have to convert that to using codecs.open as well.

Torsten Marek 2010-02-03 12:33:57

ansaurus

tags:

views:

answers:

Python: How do I force iso-8859-1 file output?

related questions