tags:

views:

34

answers:

1

I'd like to include a large compressed string in a json packet, but am having some difficulty.

import json,bz2
myString = "A very large string"  
zString = bz2.compress(myString)
json.dumps({ 'compressedData' : zString })

which will result in a

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 10-13: invalid data

An obvious solution is bz2'ing the entire json structure, but let's just assume I'm using a blackbox api that does the json encoding and it wants a dict.

Also, I'm just using bz2 as an example, I don't really care what the actual algorithm is though I noticed the same behavior with zlib.

I can understand why these two compression libraries wouldn't create utf-8 compatible output, but is there any solution that can effectively compress utf-8 strings? This page seemed like a gold mine http://unicode.org/faq/compression.html but I couldn't find any relevant python information.

+4  A: 

Do you mean "compress to UTF-8 strings"? I'll assume that, since any generic compressor will compress UTF-8 strings. However, no real-world compressor is going to compress to a UTF-8 string.

You can't store 8-bit data like UTF-8 directly in JSON, because JSON strings are defined as Unicode. You'd have to base64-encode the data before giving it to JSON:

json.dumps({ 'compressedData' : base64.b64encode(zString) })

However, base64 inherently causes a 4/3 encoding overhead. If you're compressing typical string data you'll probably get enough compression for this to still be a win, but it's a significant overhead. You might find an encoding with a little less overhead, but not much.

Note that if you're using this to send data to a browser, you're better off letting HTTP compression do this; it's widely-supported and will be much more robust.

Glenn Maynard
+1 for HTTP compression. Whilst you can theoretically get somewhat-more-efficient-than-base64 encoding, it's full of pitfalls. Let your web server (mod_deflate etc) handle it.
bobince
Sadly, it's not going to a browser but this answer was full of wondrous information.
Ralphleon