A sever I can't influence sends very broken XML.
Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86
(9 bytes) in a file that's declared as utf-8 with no DTD.
I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encountering this. I'd like to avoid BeautifulSoup for speed reasons.
What else is there?