views:

84

answers:

1

I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.

As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8. I'd appreciate your help very much.

+6  A: 

In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).

What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters

def main():
    # Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
    with open('output_file', 'w', encoding='utf-8') as out_file:
        # read every line. We give open() the encoding so it will return a Unicode string. 
        for line in open('input_file', encoding='utf-8'):
            #Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
            print(line.replace('ć', 'ç'), out_file)

So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.

Still, because it is interesting, here the hard way, where you encode everything yourself:

def main():
    # Open the file in binary mode. So we are going to write bytes to it instead of strings
    with open('output_file', 'wb') as out_file:
        # read every line. Again, we open it binary, so we get bytes 
        for line_bytes in open('input_file', 'rb'):
            #Convert the bytes to a string
            line_string = bytes.decode('utf-8')
            #Replace the characters we want. 
            line_string = line_string.replace('ć', 'ç')
            #Make a bytes to print
            out_bytes = line_string.encode('utf-8')
            #Print the bytes
            print(out_bytes, out_file)

Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!

Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.

Peter Smit
Python does use UTF-16 as the internal encoding on Windows. On Linux, it uses UTF-32.
dan04
hi,thanks for your answers.Dan04 do you know how can I tell it to use only utf-8?
idan
@idan Why would you want that? Anyway, it is not possible except when you are modifiying and recompiling Python yourself...
Peter Smit
I want this because I must work with a tokenizer which gives me the word in utf-8 and I'm not sure that if, let's I'll check a character defined by python such as 'א' (a hebrew character) it will compare it correctly with the one it the word. did you get what I said?
idan
Is this tokenizer in Python? Then just transform the bytes utf-8 into the unicode string and comparison will work correctly. There should no need to be guessing or unsure about encodings. Just check what you have and use the methods mentioned in my answer and you will be fine.
Peter Smit