views:

339

answers:

3

In python, I compressed a string using zlib, and then inserted it into a mysql column that is of type blob, using the utf-8 encoding. The string comes back as utf-8, but it's not clear how to get it back into a format where I can decompress it. Here is some pseduo-output:

valueInserted = zlib.compress('a') = 'x\x9cK\x04\x00\x00b\x00b'

valueFromSqlColumn = u'x\x9cK\x04\x00\x00b\x00b'

zlib.decompress(valueFromSqlColumn) UnicodeEncodeError: 'ascii' codec can't encode character u'\x9c' in position 1: ordinal not in range(128)

if i do this, it inserts some extra characters:

valueFromSqlColumn.encode('utf-8') = 'x\xc2\x9cK\x04\x00\x00b\x00b'

Any suggestions?

+1  A: 

Unicode is designed to be compatible with latin-1, so try:

>>> import zlib
>>> u = zlib.compress("test").decode('latin1')
>>> u
u'x\x9c+I-.\x01\x00\x04]\x01\xc1'

And then

>>> zlib.decompress(u.encode('latin1'))
'test'

EDIT: Fixed typo, latin-1 is isn't designed to be compatible with unicode, it's the other way around.

csl
Latin-1 and UTF-8 aren't compatible - there are differences once you're out of the plain ASCII range, which you'd definitely be once you zlib compressed something
Blair Conrad
@Blair: What is meant is that all latin-1's byte values correspond to a codepoint in Unicode with the same value. UTF-8 and latin-1 will render these differently.
kaizer.se
csl: You must mean Unicode is backwards-compatible with Latin-1 (in one aspect), not UTF-8.
kaizer.se
does this mean latin-1 is the wrong encoding? i'm going to be zlib compressing something that's latin1.
Heinrich Schmetterling
@Heinrich: compress returns bytes (str) that we decode to yield a unicode string (unicode). The Latin-1 encoding has the property that a latin-1 codepoint is exactly the same as the unicode codepoint (not other way around, latin-1 has only 255 codepoints). This makes sure that the bytes `'x\x9cK\x04\x00\x00b\x00b'` are translated to the same codepoints in unicode `u'x\x9cK\x04\x00\x00b\x00b'`
kaizer.se
@csl: "Latin1 is designed to be compatible with UTF-8" is palpably untrue. @Heinrich: the code supplied should work no matter what you believe the encoding to be, and should work even if the uncompressed string is not interpretable as text but represents a bitmap or some other binary blob. The .decode('latin1') is just a "magic trick" to complete the process of ensuring that the trip into the database and back out again reproduces the original compressed data. Note that I say "**should** work" ... you haven't divulged exactly what happens on that trip.
John Machin
@kaiser.se: Umm call me crazy but I thought latin1 had 256 codepoints; which one is missing?
John Machin
@John: You are probably right! In the moment I wrote the sentence I thought that 0 is probably not a real codepoint. But we count it, right?
kaizer.se
@kaiser.se: Of course it's a "real" codepoint. It represents the ASCII NUL control character.
John Machin
Yes guys, I meant the other way around. Thanks.
csl
+1  A: 
valueInserted = zlib.compress('a') = 'x\x9cK\x04\x00\x00b\x00b'

Note that this is an str object. You say that you "inserted it into a mysql column that is of type blob, using the utf-8 encoding". AS the compressed string is binary, not text, "blob" is an appropriate type of column, but ANY encoding or other transformation is a very bad idea. You need to be able to recover from the database EXACTLY right down to the last bit what you inserted, otherwise the decompression will fail, either by raising an error or (less likely, but worse) silently producing garbage.

You say that you get back after whatever process you go through in inserting it and extracting it again is:

valueFromSqlColumn = u'x\x9cK\x04\x00\x00b\x00b'

Note carefully that there is only one tiny visual difference: u'something' instead of 'something'. That makes it a unicode object. Based on your own evidence so far, "comes back as UTF-8" is not correct. A unicode object and a str object encoded in utf8 are not the same thing.

Guess 1: insert as raw string, extract with latin1 decode.

Guess 2: insert as compressed.decode('latin1').encode('utf8'), extract with utf8 decode.

You really need to understand the process of inserting and extracting, including what encodes and decodes happen by default.

Then you really need to fix your code. However in the meantime you can probably kludge up what you've got.

Note this:

>>> valueFromSqlColumn = u'x\x9cK\x04\x00\x00b\x00b'
>>> all(ord(char) <= 255 for char in valueFromSqlColumn)
True

Do some trials with more complicated input than 'a'. If, as I guess, you see that all of the unicode characters have an ordinal in range(256), then you have a simple kludge:

>>> compressed = valueFromSqlColumn.encode('latin1')
>>> compressed
'x\x9cK\x04\x00\x00b\x00b'
>>> zlib.decompress(compressed)
'a'

Why this works is that Latin1 encoding/decoding doesn't change the ordinal. You could recover the original compressed value by:

>>> compressed2 = ''.join(chr(ord(uc)) for uc in valueFromSqlColumn)
>>> compressed2
'x\x9cK\x04\x00\x00b\x00b'
>>> compressed2 == compressed
True

if you think using .encode('latin1') is too much like voodoo.

If the above doesn't work (i.e. some ordinals are not in range(256)), then you will need to produce a small runnable script that shows exactly and reproducibly how you are compressing, inserting into the database, and retrieving from the database ... sprinkle lots of print "variable", repr(variable) around your code so that you can see what is happening.

John Machin
@kaiser.se: You have interesting definitions of "exactly" and "alike". @anonymous_driveby_downvoter: care to leave an explanation?
John Machin
+1  A: 

You have a unicode object that is really encoding bytes. That's unfortunate, since unicode strings should really only be coding text, right?

Anyway, what we want to do is to construct a byte string.. this is a str in Python 2.x. We see by the printed string you gave u'x\x9cK\x04\x00\x00b\x00b' that the byte values are encoded as unicode codepoints. We can get the numerical value of a codepoint by using the function ord(..). Then we can get the byte string representation of that number with the function chr(..). Let's try this:

>>> ord(u"A")
65
>>> chr(_)
'A'

So we can decode the string ourselves:

>>> udata = u'x\x9cK\x04\x00\x00b\x00b'
>>> bdata = "".join(chr(ord(uc)) for uc in udata)
>>> bdata
'x\x9cK\x04\x00\x00b\x00b'

(Wait, what does the above code do? The join stuff? What we first do is create a list of the code points in the string:

>>> [ord(uc) for uc in udata]
[120, 156, 75, 4, 0, 0, 98, 0, 98]

Then we intepret the numbers as bytes, converting them individually:

>>> [chr(ord(uc)) for uc in udata]
['x', '\x9c', 'K', '\x04', '\x00', '\x00', 'b', '\x00', 'b']

Finally, we join them with "" as separator using "".join(list-of-strings)

End of Wait..)

However, cls cleverly notes that the Latin-1 encoding has the property that a character's byte value in the Latin-1 encoding is equal to the character's codepoint in Unicode. Given, of course, that the character is inside the range 0 to 255 where Latin-1 is defined. This means we can do the byte conversion directly with Latin-1:

>>> udata = u'x\x9cK\x04\x00\x00b\x00b'
>>> udata.encode("latin-1")
'x\x9cK\x04\x00\x00b\x00b'

Which as you can see, gives the same result.

kaizer.se