views:

487

answers:

3

Dear all,

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.

I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.

Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.

Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?

A minimal example of the code I use is:

fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()

new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()

I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.

I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...

Many thanks in advance.
Cheers,
Britta

+3  A: 

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).

The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.

Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:

\usepackage[utf8]{inputenc}

(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

Stephan202
what I mean by "raw" Unicode is that the sign is not represented by a code but the symbol itself is found in the text, as you would e.g. insert in Word over < Insert Symbol >. An example would be the symbol for a "capital pi" which is unluckely not encoded properly as Π (which can be easily displayed in LaTeX using the utf8(x) package). If I open the text with the symbols in latex, it is simply not displayed at all, and the information gets lost, therefore I need to take care of it. But I am going to have a look at the other hint concerning the codecs modulue ... Thanks :)
In this case, you need to determine the encoding of the input document. If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default). For HTML, look for "charset".
Aaron Digulla
A: 

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.

If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".

If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Aaron Digulla
A: 

Please, first, read this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Then, come back and ask questions.

bendin