tags:

views:

165

answers:

3

I'm writing some code to parse RTF documents, and need to handle the various codepages they can use. Python comes with decoders for all the necessary Windows codepages, but I'm not sure how to handle the Mac ones:

# 77: "10000", # Mac Roman
# 78: "10001", # Mac Shift Jis
# 79: "10003", # Mac Hangul
# 80: "10008", # Mac GB2312
# 81: "10002", # Mac Big5
# 83: "10005", # Mac Hebrew
# 84: "10004", # Mac Arabic
# 85: "10006", # Mac Greek
# 86: "10081", # Mac Turkish
# 87: "10021", # Mac Thai
# 88: "10029", # Mac East Europe
# 89: "10007", # Mac Russian

Does Python have any built-in support for these? If not, is there a cross-platform pure-Python library that will handle them?

+5  A: 

You can use the python codecs for these that are known by their names 'mac-roman', 'mac-turkish', etc.

>>> 'foo'.decode('mac-turkish')
u'foo'

You'll have to refer to them by their names, these numbers you've got in your question don't appear in the source files. For more information look at $pylib/encodings/mac_*.py.

Jerub
Also, those Mac encodings date back to classic MacOS days and are largely obsolete in Mac OS X.
Ned Deily
+3  A: 

It seems that at least Mac Roman and Mac Turkish encodings exist in Python stdlib, under names macroman and macturkish. See http://svn.python.org/projects/python/trunk/Lib/encodings/aliases.py for a complete list of encoding aliases in the most up-to-date Python.

Tuure Laurinolli
+2  A: 

No.

However, unicode.org provides codec description files that you can use to generate modules that will parse those codecs. Included with python source distributions is a script that will convert these files: Python-x.x/Tools/unicode/gencodec.py.

Aaron Gallagher