ansaurus

Question

Unicode (utf8) reading and writing to files in python

Answer 1

+1 A:

the \x.. sequence is something that's specific to python. It's not a universal byte escape sequence.

How you actually enter in utf8-encoded non-ascii depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option-e, then a, and almost all text editors in OS X support utf8.

ʞɔıu 2009-01-29 15:10:26

Answer 2

+7 A:

In the notation

u'Capit\xe1n\n'

the "\xe1" represents just one byte. "\x" tells you that "e1" is in hexadecimal. When you write

Capit\xc3\xa1n

into your file you have "\xc3" in it. Those are 4 bytes and in your code you read them all. You can see this when you display them:

>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

You can see that the backslash is escaped by a backslash. So you have four bytes in your string: "\", "x", "c" and "3".

Edit:

As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it.

If you actually have a string in this format you can use the string_escape codec to decode it into a normal string:

In [15]: print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. If you want to have a unicode string you have to decode again with UTF-8.

To your edit: you don't have UTF-8 in your file. To actually see how it would look like:

s = u'Capit\xe1n\n'
sutf8 = s.encode('UTF-8')
open('utf-8.out', 'w').write(sutf8)

Compare the content of the file utf-8.out to the content of the file you saved with your editor.

unbeknown 2009-01-29 15:11:59

So, what's the point of the utf-8 encoded format if python can read in files using it? In other words, is there any ascii representation that python will read in \xc3 as 1 byte?

Gregg Lind 2009-01-29 16:51:22

The answer to your "So, what's the point…" question is "Mu." (since Python can read files encoded in UTF-8). For your second question: \xc3 is not part of the ASCII set. Perhaps you mean "8-bit encoding" instead. You are confused about Unicode and encodings; it's ok, many are.

ΤΖΩΤΖΙΟΥ 2009-01-30 12:16:16

Try reading this as a primer: http://www.joelonsoftware.com/articles/Unicode.html

ΤΖΩΤΖΙΟΥ 2009-01-30 12:16:54

Answer 3

+1 A:

Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3 etc in your file.

If you want to read and write encoded files in Python, best use the codecs module.

Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
CapitÃ¡n

Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the roundtrip should work.

Torsten Marek 2009-01-29 15:13:11

Answer 4

A:

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?

Answer: You can't unless the file format provides for this. XML, for example, begins with:

<?xml encoding="utf-8"?>

This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.

As for your editor, you must check if it offers some way to set the encoding of a file.

The point of utf-8 is to be able to encode 21bit characters (Unicode) as an 8bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.

The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).

That said, you can use the Python function eval() to turn an escaped string into a string:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

As you can see, the string "\xc3" has been turned into a single character. This is now an 8bit string, utf-8 encoded. To get unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

[EDIT] Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ascii that would work?

Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode unicode up to 0xffff (65535).

So you can't directly write unicode to ascii (because ascii simply doesn't contain the same characters). What you can do is write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as utf-8, in which case, you need an 8bit safe stream.

Your solution using decode('string-escape') does work but you must be aware how much memory you use: Three times the amount of using codecs.open().

Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.

Aaron Digulla 2009-01-29 16:54:42

I think there are some pieces missing here:the file f2 contains: hex: 0000000: 4361 7069 745c 7863 335c 7861 316e 0a Capit\xc3\xa1n.codecs.open('f2','rb', 'utf-8') , for example, reads them all in a separate chars (expected)Is there any way to write to a file in ascii that would work?

Gregg Lind 2009-01-29 17:21:07

Answer 5

A:

So, I've found a solution for what I'm looking for, which is:

print open('f2').read().decode('string-escape').decode("utf-8")

There are some unusual codecs that are useful here. This particular reading allows one to take utf-8 representations from within python, copy them into an ascii file, and have them be read in to unicode. Under the "string-escape" decode, the slashes won't be doubled.

This allows for the sort of round trip that I was imagining.

Gregg Lind 2009-01-29 20:01:27

Answer 6

+7 A:

Rather than mess with the encode, decode methods I find it easier to use the open method from the codecs module.

>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")

Then after calling f's read() function, an encoded unicode object is returned.

>>>f.read()
u'Capit\xe1l\n\n'

If you know the encoding of a file, using the codecs package is going to be much less confusing.

See http://docs.python.org/library/codecs.html#codecs.open

Tim Swast 2009-05-10 00:45:58

ansaurus

tags:

views:

answers:

Unicode (utf8) reading and writing to files in python

related questions