views:

481

answers:

2

Currently I have a simple IRC bot written in python.

Since I migrated it to python 3.0 which differentiates between bytes and unicode strings I started having encoding issues. Specifically, with others not sending UTF-8.

Now, I could just tell everyone to send UTF-8 (which they should regardless) but an even better solution would be try to get python to default to some other encoding or such.

So far the code looks like this:

data = str(irc.recv(4096),"UTF-8", "replace")

Which at least doesn't throw exceptions. However, I want to go past it: I want my bot to default to another encoding, or try to detect "troublesome characters" somehow.

Additionally, I need to figure out what this mysterious encoding that mIRC uses actually is - as other clients appear to work fine and send UTF-8 like they should.

How should I go about doing those things?

+3  A: 

chardet should help - it's the canonical Python library for detecting unknown encodings.

RichieHindle
Trying that now. I'll see where it takes me.
Adi
A: 

Ok, after some research turns out chardet is having troubles with python 3. The solution as it turns out is simpler than I thought. I chose to fall back on CP1252 if UTF-8 doesn't cut it:

data = irc.recv ( 4096 )
try: data = str(data,"UTF-8")
except UnicodeDecodeError: data = str(data,"CP1252")

Which seems to be working. Though it doesn't detect the encoding, and so if somebody came in with an encoding that is neither UTF-8 nor CP1252 I will again have a problem.

This is really just a temporary solution.

Adi
cp1252 will always appear to work for any non-zero byte sequence, because it assigns a codepoint to every byte value except zero.
RichieHindle