This is somehow related to my question here.
I process tons of texts (in HTML and XML mainly) fetched via HTTP. I'm looking for a library in python that can do smart encoding detection based on different strategies and convert texts to unicode using best possible character encoding guess.
I found that chardet does auto-detection extremely well. However auto-detecting everything is the problem because it is SLOW and very much against all standards. As per chardet
FAQ I don't want to screw the standards.
From the same FAQ here is the list of places where I want to look for encoding:
- charset parameter in HTTP
Content-type
header. <meta http-equiv="content-type">
element in the<head>
of a web page for HTML documents.- encoding attribute in the XML prolog for XML documents.
- Auto-detect the character encoding as a last resort.
Basically I want to be able to look in all those place and also deal with conflicting information automatically.
Is there such library out there or do I need to write it myself?