I am trying to parse a file encoded in utf-8
. No operation has problem apart from write to file (or at least I think so). A minimum working example follows:
from lxml import etree
parser = etree.HTMLParser()
tree = etree.parse('example.txt', parser)
tree.write('aaaaaaaaaaaaaaaaa.html')
example.txt:
<html>
<body>
<invalid html here/>
<interesting attrib1="yes">
<group>
<line>
δεδομένα1
</line>
</group>
<group>
<line>
δεδομένα2
</line>
</group>
<group>
<line>
δεδομένα3
</line>
</group>
</interesting>
</body>
</html>
I am already aware of a similar previous question but I could not solve the problem either without specifying the output encoding, or using utf8
or iso-8859-7
.
I have concluded that the file is in utf8
since it displays correctly at Chrome when choosing this encoding. My editor (Kate) agrees.
I get no runtime error, but the output is not as desired.
Example output with tree.write('aaaaaaaaaaaaaaaaa.html', encoding='utf-8')
:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<invalid html="" here=""/><interesting attrib1="yes"><group><line>
δεδομÎνα1
</line></group><group><line>
δεδομÎνα2
</line></group><group><line>
δεδομÎνα3
</line></group></interesting></body></html>