views:

3047

answers:

4

Any thoughts on why this isn't working? I really thought 'ignore' would do the right thing.

>>> 'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)
+2  A: 

encode is available to unicode strings, but the string you have there does not seems unicode (try with u'add \x93Monitoring\x93 to list ')

>>> u'add \x93Monitoring\x93 to list '.encode('latin-1','ignore')
'add \x93Monitoring\x93 to list '
Roberto Liffredo
well the string is coming in that way as non unicode. So I need to do something to the string.
Greg
This means that the string you get has already been encoded. In the example below, you simply decode and encode again - assuming a latin-1 encoding (and this may not always be true). I think you can simply go on with your string, and letting the output handling it correctly.
Roberto Liffredo
A: 

This seems to work:

'add \x93Monitoring\x93 to list '.decode('latin-1').encode('latin-1')

Any issues with that? I wonder when 'ignore', 'replace' and other such encode error handling comes in?

Greg
It comes in when you want to encode a unicode string that contains code points that are not representable in your choosen encoding, i.e. chinese characters in latin1. You can then specify how the encoding should react to such code points.
unbeknown
As said above, this is doing nothing.You are passing through a function, then on its reverse. The final string is in the best case the very same as the original; in the worst you have issues like those outlined by Heiko.
Roberto Liffredo
Seems to work?? str_object.decode('latin1').encode('latin1') == str_object FOR ALL STR OBJECTS. In other words, it does exactly nothing.
John Machin
It does nothing for Latin-1. It's different for encodings for which arbitrary byte sequences aren't always valid, or have multiple encodings of the same character.
dan04
+31  A: 

…there's a reason they're called "encodings"…

A little preamble: think of unicode as the norm, or the ideal state. Unicode is just a table of characters. №65 is latin capital A. №937 is greek capital omega. Just that.
In order for a computer to store and-or manipulate Unicode, it has to encode it into bytes. The most straightforward encoding of Unicode is UCS-4; every character occupies 4 bytes, and all ~1000000 characters are available. The 4 bytes contain the number of the character in the Unicode tables as a 4-byte integer. Another very useful encoding is UTF-8, which can encode any Unicode character with one to four bytes. But there also are some limited encodings, like "latin1", which include a very limited range of characters, mostly used by Western countries. Such encodings use only one byte per character.

Basically, Unicode can be encoded with many encodings, and encoded strings can be decoded to Unicode. The thing is, Unicode came quite late, so all of us that grew up using an 8-bit character set learned too late that all this time we worked with encoded strings. The encoding could be ISO8859-1, or windows CP437, or CP850, or, or, or, depending on our system default.

So when, in your source code, you enter the string "add “Monitoring“ to list" (and I think you wanted the string "add “Monitoring” to list", note the second quote), you actually are using a string already encoded according to your system's default codepage (by the byte \x93 I assume you use Windows codepage 1252, “Western”). If you want to get Unicode from that, you need to decode the string from the "cp1252" encoding.

So, what you meant to do, was:

"add \x93Monitoring\x94 to list".decode("cp1252", "ignore")

It's unfortunate that Python 2.x includes an .encode method for strings too; this is a convenience function for "special" encodings, like the "zip" or "rot13" or "base64" ones, which have nothing to do with Unicode.

Anyway, all you have to remember for your to-and-fro Unicode conversions is:

  • a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes)
  • a Python 2.x string gets decoded to a Unicode string

In both cases, you need to specify the encoding that will be used.

I'm not very clear, I'm sleepy, but I sure hope I help.

PS A humorous side note: Mayans didn't have Unicode; ancient Romans, ancient Greeks, ancient Egyptians didn't too. They all had their own "encodings", and had little to no respect for other cultures. All these civilizations crumbled to dust. Think about it people! Make your apps Unicode-aware, for the good of mankind. :)

PS2 Please don't spoil the previous message by saying "But the Chinese…". If you feel inclined or obligated to do so, though, delay it by thinking that the Unicode BMP is populated mostly by chinese ideograms, ergo Chinese is the basis of Unicode. I can go on inventing outrageous lies, as long as people develop Unicode-aware applications. Cheers!

ΤΖΩΤΖΙΟΥ
I think this should from now on be the default answer for all Python+Unicode questions.
unbeknown
25 years of programming, 10 years programming Python and it's the first time in my life I'm understanding encodings so clearly.
Oli
Unicode is not just a table of characters e.g., a single abstract character may be represented by a sequence of code points: latin capital letter g with acute (corresponding coded character u"\u01F4" or 'Ǵ') is represented by the sequence u"\u0047\u0301" (or 'Ǵ'). http://is.gd/eTLi-
J.F. Sebastian
@J.F. Sebastian: no, Unicode isn't just a table of characters. I oversimplified things just for the purposes of this answer.
ΤΖΩΤΖΙΟΥ
Nice answer guy with the Omega in his name. I just [answered a similar question](http://stackoverflow.com/questions/3224427/python-sanitize-a-string-for-unicode/3224799#3224799) but hadn't seen your answer yet.
darkporter
Also, I believe UTF-8 uses 1 to 6 bytes. There are 2^32 characters possible, but the encoding itself has some overhead for tracking multibyte sequence length.
darkporter
@darkporter: yes, UTF-8 in theory could use up to 6 bytes, iff the Unicode standard used the complete 32 bit range for characters. Currently, though, the maximum Unicode character is U+10FFFF, and all Unicode characters need 4 bytes at the most when encoded as UTF-8.
ΤΖΩΤΖΙΟΥ
+1  A: 

I also wrote a long blog about this subject:

The Hassle of Unicode and Getting on With It

Gregg Lind