views:

460

answers:

2

Stumbled upon some seemingly random character mangling in eclipse-pydev console: specific characters are read from stdout as '\xd0?' (first byte correct, second "?")

Is there some solution to this?

(PyDEV 1.4.6, Python 2.6, console encoding - inherited UTF-8, Eclipse 3.5, WinXP with UK locale)

Code:

import sys
if __name__ == "__main__":
    for l in sys.stdin:
        print 'Byte:   ', repr(l)
        try:
            u = repr(unicode(l))
            print 'Unicode:', u
        except Exception, e:
            print 'Fail:   ', e

Input:

йцукенгшщзхъ
фывапролджэ
ячсмитьбю
ЙЦУКЕНГШЩЗХЪ
ФЫВАПРОЛДЖЭ
ЯЧСМИТЬБЮ

and output:

Byte:    '\xd0\xb9\xd1\x86\xd1\x83\xd0\xba\xd0\xb5\xd0\xbd\xd0\xb3\xd1\x88\xd1\x89\xd0\xb7\xd1\x85\xd1\x8a\r\n'
Unicode: u'\u0439\u0446\u0443\u043a\u0435\u043d\u0433\u0448\u0449\u0437\u0445\u044a\r\n'
Byte:    '\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0\xd0\xbf\xd1\x80\xd0\xbe\xd0\xbb\xd0\xb4\xd0\xb6\xd1?\r\n'
Fail:    'utf8' codec can't decode bytes in position 20-21: invalid data
Byte:    '\xd1?\xd1\x87\xd1?\xd0\xbc\xd0\xb8\xd1\x82\xd1\x8c\xd0\xb1\xd1\x8e\r\n'
Fail:    'utf8' codec can't decode bytes in position 0-1: invalid data
Byte:    '\xd0\x99\xd0\xa6\xd0\xa3\xd0\x9a\xd0\x95\xd0?\xd0\x93\xd0\xa8\xd0\xa9\xd0\x97\xd0\xa5\xd0\xaa\r\n'
Fail:    'utf8' codec can't decode bytes in position 10-11: invalid data
Byte:    '\xd0\xa4\xd0\xab\xd0\x92\xd0?\xd0\x9f\xd0\xa0\xd0\x9e\xd0\x9b\xd0\x94\xd0\x96\xd0\xad\r\n'
Fail:    'utf8' codec can't decode bytes in position 6-7: invalid data
Byte:    '\xd0\xaf\xd0\xa7\xd0\xa1\xd0\x9c\xd0\x98\xd0\xa2\xd0\xac\xd0\x91\xd0\xae\r\n'
Unicode: u'\u042f\u0427\u0421\u041c\u0418\u0422\u042c\u0411\u042e\r\n'
A: 

I'm not too sure about input encoding, but I've found that with output encoding to tty streams, an explicit encoding step was needed for Python 2.x but not for Python 3.x.

So for input you may need an explicit decode step using e.g. l.decode(sys.stdin.encoding).

Does it work OK in a vanilla Python console?

Vinay Sajip
+2  A: 

Well, I don't know how to fix it, but I have deduced the pattern in what goes wrong.

The bytes that get replaced with "?" are precisely those bytes that are not defined in windows-1252 - that is, bytes 0x81, 0x8d, 0x8f, 0x90, and 0x9d.

What this looks like to me is that somehow you're getting this series of translations:

  • unicode input -> series of bytes in utf-8

  • utf-8 bytes -> read by something that expects the input to be Windows-1252, and so translates impossible bytes to "?"

  • the characters in converted back to bytes via windows-1252, and fed into your variable l.

Does this version of pydev give sys.stdin.encoding a decent value? And how does sys.stdin.encoding compare to the result of sys.getdefaultencoding()?

Daniel Martin
Very plausable explaination, thankssys.stdin.encoding == sys.getdefaultencoding() == 'utf-8'
ymv