views:

48

answers:

1

I am trying to parse a file encoded in utf-8. No operation has problem apart from write to file (or at least I think so). A minimum working example follows:

from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')

example.txt:

<html>
    <body>
        <invalid html here/>
        <interesting attrib1="yes">
            <group>
                <line>
                    δεδομένα1
                </line>
            </group>
            <group>
                <line>
                    δεδομένα2
                </line>
            </group>
            <group>
                <line>
                    δεδομένα3
                </line>
            </group>
        </interesting>
    </body>
</html> 

I am already aware of a similar previous question but I could not solve the problem either without specifying the output encoding, or using utf8 or iso-8859-7.

I have concluded that the file is in utf8 since it displays correctly at Chrome when choosing this encoding. My editor (Kate) agrees.

I get no runtime error, but the output is not as desired. Example output with tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8'):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body>
        <invalid html="" here=""/><interesting attrib1="yes"><group><line>
                    δεδομένα1
                </line></group><group><line>
                    δεδομένα2
                </line></group><group><line>
                    δεδομένα3
                </line></group></interesting></body></html>
+1  A: 

The obvious problem is that HTMLParser treats the input file as ANSI by default, i.e. the UTF-8 bytes are misinterpreted as 8-bit character codes. You can simply pass the encoding to fix this:

parser = etree.HTMLParser(encoding = "utf-8")

If you want to check what I meant with the misinterpretation, let Python print repr(tree.xpath("//line")[0].text) with and without HTMLParser's encoding parameter.

AndiDog