Consider the following problem:
A multi-line string $junk
contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.
I want to turn $junk
into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.
My current attempt looks like this:
$junk = force_utf8($junk);
sub force_utf8 {
my $input = shift;
my $output = '';
foreach my $line (split(/\n/, $input)) {
if (utf8::valid($line)) {
utf8::decode($line);
}
$output .= "$line\n";
}
return $output;
}
Obviously the conversion will never be perfect since we're lacking information about the original encoding of each line. But is this the "best effort result" we can get?
How would you improve the heuristics/functionality of the force_utf8(...)
sub?