views:

160

answers:

3

I'm facing problems when trying to convert a UTF-8 file (containing Russian characters) into an ISO-8859-5 file: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to . Has anyone got an idea of what's wrong(?) given the following:

def convert():
    try:
        import codecs
        data = codecs.open('in.txt', 'r', 'utf-8').read()
    except Exception, e:
        print e
        sys.exit(1)

    f = open('out.txt', 'w')

    try:
        f.write(data.encode('iso-8859-5'))
    except Exception, e:
        print e
    finally:
        f.close()

"in.txt": ё!—№%«»(эюпоиуыяафйклж;нцхз

+2  A: 

feff is a Byte-Order-Mark character. ISO-8859-5 won't have any representation for it.

You'll need to strip it off your data variable before encoding it into ISO-8859-5.

Douglas Leeder
There's a codec to do it for you, so stripping it manually probably isn't the best of ideas. See: utf-8-sig .
Devin Jeanpierre
A: 

Thanks for sharing this information! Actually, it struck me now, I want to convert utf-8 into 8 bit OEM character sets...or is that the same as ISO 8859-X? I'm not quite sure. Can that be done in python?

AO
**don't** post "answers" like this to your own question -- edit the question for changes such as the extra questions you're posing here, and/or comment on the specific answer you're being thankful for. Seriously, this behavior is **not** compatible with the way SO works -- for example, now nobody can tell **who** you are thanking out of the two responders! Delete this "answer" and do it right, please.
Alex Martelli
+2  A: 

Recent versions of Python have the utf-8-sig codec that will automatically strip the BOM off a UTF-8-encoded string or file when reading it:

>>> print '\xef\xbb\xbf\xe3\x81\x82'.decode('utf-8-sig')
あ
Ignacio Vazquez-Abrams