views:

207

answers:

4

Consider the following problem:

A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.

I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.

My current attempt looks like this:

$junk = force_utf8($junk);

sub force_utf8 {
  my $input = shift;
  my $output = '';
  foreach my $line (split(/\n/, $input)) {
    if (utf8::valid($line)) {
      utf8::decode($line);
    }
    $output .= "$line\n";
  }
  return $output;
}

Obviously the conversion will never be perfect since we're lacking information about the original encoding of each line. But is this the "best effort result" we can get?

How would you improve the heuristics/functionality of the force_utf8(...) sub?

+1  A: 

Just by looking at a character it will be hard to tell if it is ISO-8859-1 or UTF-8 encoded. The problem is that both are 8-bit encodings, so simply looking at the MSb is not sufficient. For every line, then, I would transcode the line assuming it is UTF-8. When an invalid UTF-8 encoding is found re-transcode the line assuming that the line is really ISO-8859-1. The problem with this heuristic is that you might transcode ISO-8859-1 lines that are also well-formed UTF-8 lines; however without external information about $junk there is no way to tell which is appropriate.

fbrereto
UTF-8 is *NOT* an 8-bit encoding. It reperesent commonly-used Western characters in 8 bits ("low" or "7-bit" ASCII), but will use multibyte characters if needed.
DaveE
UTF-8 is an 8-bit encoding that is also 100% compatible with 7-bit ASCII. Whether or not it uses all 8 bits for a given character is orthogonal to the point.
fbrereto
No, it is not an 8-bit encoding. While some UTF-8 strings might only consist of characters using 8 bits, any given UTF-8 character in a string can be up to four bytes (32 bits) in size. See en.wikipedia.org/wiki/UTF-8 or tools.ietf.org/html/rfc3629.
DaveE
I see your point; I still think the heuristic will work for all intensive purposes.
fbrereto
<pedantic>it's properly *for all intents and purposes*, e.g 'whatever your intent or whatever your purpose, this heuristic should work'.</pedantic>
DaveE
+2  A: 

You might be able to fix it up using a bit of domain knowledge. For example, é is not a likely character combination in ISO-8859-1; it is much more likely to be UTF-8 é.

If your input is limited to a restricted pool of characters, you can also use a heuristic such as assuming à will never occur in your input stream.

Without this kind of domain knowledge, your problem is in general intractable.

Philip Potter
The code will handle inputs in a wide variety of languages, so enumerating specific translations is unfortunately not an option.
knorv
+1  A: 

Take a look at this article. UTF-8 is optimised to represent Western language characters in 8 bits but it's not limited to 8-bits-per-character. The multibyte characters use common bit patterns to indicate if they are multibyte, and how many bytes the character uses. If you can safely assume only the two encodings in your string, the rest should be simple.

DaveE
+1  A: 

I have no useful advice to offer except that I would have tried using Encode::Guess first.

Sinan Ünür