Character encoding is violated

I am trying to parse a file encoded in utf-8. No operation has problem apart from write to file (or at least I think so). A minimum working example follows:

from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')

example.txt:

<html>
    <body>
        <invalid html here/>
        <interesting attrib1="yes">
            <group>
                <line>
                    δεδομένα1
                </line>
            </group>
            <group>
                <line>
                    δεδομένα2
                </line>
            </group>
            <group>
                <line>
                    δεδομένα3
                </line>
            </group>
        </interesting>
    </body>
</html>

I am already aware of a similar previous question but I could not solve the problem either without specifying the output encoding, or using utf8 or iso-8859-7.

I have concluded that the file is in utf8 since it displays correctly at Chrome when choosing this encoding. My editor (Kate) agrees.

I get no runtime error, but the output is not as desired. Example output with tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8'):

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body>
        <invalid html="" here=""/><interesting attrib1="yes"><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±1
                </line></group><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±2
                </line></group><group><line>
                    Î´ÎµÎ´Î¿Î¼ÎÎ½Î±3
                </line></group></interesting></body></html>

ansaurus

tags:

views:

answers:

Character encoding is violated

related questions