views:

41

answers:

1

I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added

sys.setdefaultencoding('latin_1') 

sure enough, now working with latin1 strings works fine.

But, in case I encounter something that is not encoded in latin1:

s=str(u'abc\u2013')

I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)

What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.

I tried doing different things with codecs.register_error but to no avail.

please help?

+1  A: 

There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.

Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.

Explicit decoding takes a parameter specifying behavior for undecodable bytes.

Mark Tolonen