views:

86

answers:

2

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.

However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?

The relevant code:

use WWW::Mechanize;
my $mech = WWW::Mechanize->new;   
...
foreach (@urls) {
    $mech->get($_); 
    print FILE $mech->content;  #MESSAGE REFERS TO THIS LINE
...

This is on OSX with Perl 5.8.8.

+1  A: 

I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..

The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.

To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being valid UTF-8. More details can be found in PerlIO::encoding.

Evan Carroll
@Evan: I am dealing with plain html here. Can I do anything to check the files I already have, or is my only option to restart from the beginning with `binmode(FILE, ":utf8')` inserted after `open FILE etc`?
Larry Wang
it isn't your only option, you can do it in one cmd with an `open ( my $fh, '>:utf8', ... )`. There are lots of problems that can cause this, read [this blog post](http://www.ahinea.com/en/tech/perl-unicode-struggle.html) for the nitty gritty.
Evan Carroll
@Evan: Thanks for the link, but that article also only deals with how to prevent this problem, rather than how to assess its severity and how to fix it after it has happened, which is the focus of my question.
Larry Wang
+2  A: 

If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).

For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.

Grant McLean