ansaurus

Question

Answer 1

+3 A:

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).

The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.

Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:

\usepackage[utf8]{inputenc}

(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

Stephan202 2009-05-26 10:09:09

what I mean by "raw" Unicode is that the sign is not represented by a code but the symbol itself is found in the text, as you would e.g. insert in Word over < Insert Symbol >. An example would be the symbol for a "capital pi" which is unluckely not encoded properly as Π (which can be easily displayed in LaTeX using the utf8(x) package). If I open the text with the symbols in latex, it is simply not displayed at all, and the information gets lost, therefore I need to take care of it. But I am going to have a look at the other hint concerning the codecs modulue ... Thanks :)

2009-05-26 10:33:32

In this case, you need to determine the encoding of the input document. If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default). For HTML, look for "charset".

Aaron Digulla 2009-05-26 10:36:42

Answer 2

A:

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.

If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".

If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Aaron Digulla 2009-05-26 10:39:30

Answer 3

A:

Please, first, read this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Then, come back and ask questions.

bendin 2009-05-26 10:42:40

ansaurus

tags:

views:

answers:

Reading "raw" Unicode-strings in Python

related questions