ansaurus

Question

Why does HTML::TreeBuilder show mojibake/weird characters in the output?

Answer 1

+1 A:

You did not get an answer for a whole day because your code's a mess both in shape and content and you didn't even bother to make a reduced test case out of your whole program. MvanGeest also produced a misdiagnosis in the comment attached to the question.

The problem is that the people who wrote Breitbart's CMS are clueless, they insert the NCR  (which is a non-printable character, and perhaps even an invalid character) when they should have simply inserted the character — (U+2014 EM DASH); after all, the document encoding is declared UTF-8. (One can clearly see that the encoding was supposed to be Windows-1252, where the codepoint 151 (decimal) is allocated.)

You can work around the incompetence on their part with an explicit decoding/encoding step.

use Encode qw(encode decode);
⋮
my $string_representation = $dom_tree->as_HTML('<>&', ' ', {});
my $octets = encode('UTF-8', decode('Windows-1252', $string_representation);
⋮
# send the correct Content-Type header in your CGI program before printing the HTTP body
print $octets;

daxim 2010-06-10 13:18:40

ansaurus

tags:

views:

answers:

Why does HTML::TreeBuilder show mojibake/weird characters in the output?

related questions