views:

26

answers:

1

I have a model form with a file field on it. I have a post_save signal attached to the model so that I can then pass the uploaded file on to a 3rd-party via a web service (using Suds). The web service call is dying when I try to pass it the file contents: it throws "UnicodeDecodeError: 'ascii' codec can't decode byte . . . " (much like in this SO question).

The thing I don't get: when I dump out the file contents to screen during my signal call, it looks like a mess of badly-encoded garbage:

åÉe Qçú>↑ Åû½ΣΘ⌐v^τ  F,K╪Y<▲î°bαⁿ╡ê5  ╜ù  sö╛Aî▲ƒF|04∙f╛@╙We⌡  ╤â╩_α↑└ƒ∙│ßï(è═|←⌂┌▒■µ'£─♂  ¢V↓ⁿq_;εδ▼εb<í╜ƒÅΩN00τó╛‼¥U╫Z─)?¬∞┐Γ╠C4ä▬Il☼Jº╚J╥Ñ├¿öÆi2═♂ïσNù&▐╤╡╔ΩIêµ╬]└@Üα╒→║¶\⌐UÑ╬çµ∟h⌂¼┘ë¢←↕╚↔ùα▌.¢d╖Y¡,♫½qÆ~╞äLX┬ä[┬2≥¥í=<ß▼]Hⁿ↕!b÷ ñÑU┌M╥╦m¼'½ù'∞"'£└►oêu↓q┘ôÉ>i_÷αµ0♥k§w▒c╠═╬6╙N2▀!)`►

when I grab the same object via the command line and call the exact same method on it, it all looks nicely encoded:

\x00F,K\xd8Y<\x1e\x8c\xf8b\xe0\xfc\xb5\x885\xff\x00\xbd\x97\xff\x00s\x07\x94\xbeA\x8c\x1e\x9fF|04\xf9f\xbe@\xd3We\xf5\xff\x00\xd1\x83\xca_\xe0\x18\xc0\x9f\xf9\xb3\xe1\x8b(\x8a\xcd|\x1b\x7f\xda\xb1\xfe\xe6\'\x9c\xc4\x0b\xff\x00\x9bV\x19\x07\xfcq_;\xee\xeb\x1f\xeeb<\xa1\xbd\x9f\x8f\xeaN00\xe7\xa2\xbe\x13\x9dU\xd7Z\xc4)?\xaa\xec\xbf\xe2\xccC4\x84\x16Il\x0fJ\xa7\xc8J\xd2\xa5\xc3\xa8\x94\x92i2\xcd\x0b\x8b\xe5N\x97&\xde\xd1\xb5\xc9\xeaI\x88\xe6\xce]\xc0@\x9a\xe0\xd5\x1a\xba\x14\\\xa9U\xa5\xce\x87\xe6\x1ch\x7f\xac\xd9\x89\x9b\x1b\x12\xc8\x1d\x97\xe0\xdd.\x9bd\xb7Y\xad,\x0e\xabq\x92~\xc6\x84LX\xc2\x84[\xc22\xf2\x9d\xa1=<\xe1\x1f]H\xfc\x12!b\xf6\x00\xa4\xa5U\xdaM\xd2\xcbm\xac\'\xab\x97\'\xec"\'\x9c\xc0\x10o\x88u\x19q\xd9\x93\x90>i_\xf6\xe0\xe60\x03k\x15w\xb1c\xcc\xcd\xce6\xd3N2\xdf!)`\x10\nB\x8a\xaes\x13\xad\xd4a\x19\xa7p?\xff\xd9'

What's happening between the two steps and how can I get the proper contents back? Grabbing a second version of the object during my signal just gives me back the badly-encoded mess again. N.B., this is happening on Windows.

A: 

An ASCII codec obviously cannot decode this because it isn't ASCII. I think you will have to find out the encoding of the data and pass an unicode string to Suds. For example, if the encoding is UTF-16 pass unicode(binarydata, 'utf-16') to Suds.

What you regard as a mess of badly-encoded garbage in you first example is simply what your screen displays when you let it show binary data. The characters that are displayed here depend on the character set configuration of your system.

Your second example is Python's string representation of some binary data. A string representation contains only printable ASCII characters. The non-printable or non-ascii characters are displayed using a hexadecimal notation. This string representation just shows the bytes of your data and doesn't tell you whether the data is nicely encoded or not in some character set.

I was not able to properly identify the encoding of your second example. The closest I found was 'utf-16-le'. But this still causes decoding errors at surrogate pairs.

Using s.decode('utf-16-le', 'replace') I got a bunch of Chinese characters:

䘀䬬姘Ḽ뗼㖈ÿ鞽ÿݳ뺔豁鼞籆㐰曹䂾埓ÿ菑忊ᣠ鿀돹诡訨糍缛뇚鰧௄ÿ嚛ܙ燼㭟㱢붡辟仪〰ꋧᎾ喝嫗⧄꨿뿬쳢㑃ᚄ汉䨏좧퉊쎥钨榒촲謋以⚗퇞즵䧪巎䃀᫕ᒺ꥜ꕕ蟎᳦罨�鮉ማ᷈⻝撛妷ⲭꬎ鉱왾䲄쉘宄㋂鷲㶡崟ﱈℒꐀ喥䷚쯒걭꬧➗⋬鰧Ⴠ衯᥵�邓椾̰ᕫ녷챣컍팶㉎⇟怩ਐ詂玮괓懔ꜙ㽰�

The interesting thing is that Google translates the third character, , to http.

UPDATE: The following interactive Python session may clarify what I mean in my comment below:

>>> s = '\x00F,K\xd8Y'
>>> print(s)
F,K�Y
>>> u = s.decode('utf-16-le')
>>> u
u'\u4600\u4b2c\u59d8'
>>> print(u)
䘀䬬姘
>>> 
Bernd Petersohn
I understand the first is the binary encoded version of the file. What I don't understand is why I get it when I access the file at save time, but get the Unicode representation when I open it at the command line. I'm guessing it has something to do with being on Windows and the encoding set on my code files.
Tom
@Tom: I assume that in the first version you are directly printing the binary content of the file to the screen, whereas in the second version you let Python print a string representation of the data (possibly in an interactive Python shell?). The second version is still not unicode - the string representation just shows the bytes. Python's representation of a unicode string would look like `u'\u4600\u4b2c\u59d8'` (these are the first three Chinese characters from above).
Bernd Petersohn