Problem
When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up.
People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or they use an encoding detector (which does not care about meta tags or headers). By using only one these techniques, sometimes you will not get the same result as you would in a browser.
Browsers do it this way:
- Meta tags always takes precedence (or xml definition)
- Encoding defined in the header is used when there is no charset defined in a meta tag
- If the encoding is not defined at all, than it is time for encoding detection.
(Well... at least that is the way I believe most browsers do it. Documentation is really scarce.)
What I'm looking for is a library that can decide the character set of a page the way a browser would. I'm sure I'm not the first who needs a proper solution to this problem.
Solution (I have not tried it yet...)
According to Beautiful Soup's documentation.
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
- An encoding you pass in as the fromEncoding argument to the soup constructor.
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252