views:

104

answers:

2

How to make python 3 (3.1) to print("Some text") to stdout in utf8 ...
or how to output raw bytes..

Test.py

TestText = "Test - āĀēĒčČ..šŠūŪžŽ" # this is UTF-8  
TestText2 = b"Test2 - \xc4\x81\xc4\x80\xc4\x93\xc4\x92\xc4\x8d\xc4\x8c..\xc5\xa1\xc5\xa0\xc5\xab\xc5\xaa\xc5\xbe\xc5\xbd" # just bytes  
print(sys.getdefaultencoding())  
print(sys.stdout.encoding)  
print(TestText)  
print(TestText.encode("utf8"))  
print(TestText.encode("cp1252","replace"))  
print(TestText2)  

Output: \\ in cp1257 and I replaced chars to byte values [xHEX]

utf-8
cp1257
Test - [xE2][xC2][xE7][C7][xE8][xC8]..[xF0][xD0][xFB][xDB][xFE][xDE]
b'Test - \xc4\x81\xc4\x80\xc4\x93\xc4\x92\xc4\x8d\xc4\x8c..\xc5\xa1\xc5\xa0\xc5\xab\xc5\xaa\xc5\xbe\xc5\xbd'
b'Test - ??????..\x9a\x8a??\x9e\x8e'
b'Test2 - \xc4\x81\xc4\x80\xc4\x93\xc4\x92\xc4\x8d\xc4\x8c..\xc5\xa1\xc5\xa0\xc5\xab\xc5\xaa\xc5\xbe\xc5\xbd'

print() is just too smart... :D
there's no point using encoded text with print (it always show only representation of bytes not real bytes)
and it's impossible to output bytes at all, because print anyway and always encodes it in sys.stdout.encoding

for example:

print(chr(255))  

throws an error

Traceback (most recent call last):
File "Test.py", line 1, in
print(chr(255));
File "H:\Python31\lib\encodings\cp1257.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\xff' in position 0: character maps to

by the way print( TestText == TestText2.decode("utf8")); returns False...
although print output is same...

EDIT:

How python 3 gets sys.stdout.encoding and how to change it?

I made printRAW function witch works fine :) (tnx Zack)
(actually it encodes output to UTF-8, so in real it's not raw...)

def printRAW(*Text):  
    RAWOut = open(1, 'w', encoding='utf8', closefd=False)  
    print(*Text, file=RAWOut)  
    RAWOut.flush()  
    RAWOut.close()   

printRAW("Cool", TestText)  

output: \\ now it print in UTF-8

Cool Test - āĀēĒčČ..šŠūŪžŽ

printRAW(chr(252)) also nicely prints ü (in UTF-8, [xC3][xBC]) and without errors :)

Now I'm looking for maybe better solution if there's any...

+1  A: 

This is the best I can dope out from the manual, and it's a bit of a dirty hack:

utf8stdout = open(1, 'w', encoding='utf-8', closefd=False) # fd 1 is stdout
print(whatever, file=utf8stdout)

It seems like file objects should have a method to change their encoding, but AFAICT there isn't one.

If you write to utf8stdout and then write to sys.stdout without calling utf8stdout.flush() first, or vice versa, bad things may happen.

Zack
+3  A: 

First, a correction:

TestText = "Test - āĀēĒčČ..šŠūŪžŽ" # this NOT utf-8...it is a Unicode string in Python 3.X.
TestText2 = TestText.encode('utf8') # THIS is "just bytes" in UTF-8.

Now, to send UTF-8 to stdout, regardless of the console's encoding, use the right tool for the job:

import sys
sys.stdout.buffer.write(TestText2)

"buffer" is a raw interface to stdout.

Mark Tolonen
thanks :)by the way when I said: "Test - āĀēĒčČ..šŠūŪžŽ" # this is UTF-8 I mean that string is written in UTF-8 with IDE, py file is encoded UTF-8 and when python parses file it converts string to Python unicode...
davispuh