ansaurus

Question

Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl

Answer 1

+1 A:

Just by looking at a character it will be hard to tell if it is ISO-8859-1 or UTF-8 encoded. The problem is that both are 8-bit encodings, so simply looking at the MSb is not sufficient. For every line, then, I would transcode the line assuming it is UTF-8. When an invalid UTF-8 encoding is found re-transcode the line assuming that the line is really ISO-8859-1. The problem with this heuristic is that you might transcode ISO-8859-1 lines that are also well-formed UTF-8 lines; however without external information about $junk there is no way to tell which is appropriate.

fbrereto 2010-03-31 17:55:33

UTF-8 is *NOT* an 8-bit encoding. It reperesent commonly-used Western characters in 8 bits ("low" or "7-bit" ASCII), but will use multibyte characters if needed.

DaveE 2010-03-31 20:10:11

UTF-8 is an 8-bit encoding that is also 100% compatible with 7-bit ASCII. Whether or not it uses all 8 bits for a given character is orthogonal to the point.

fbrereto 2010-03-31 22:06:37

No, it is not an 8-bit encoding. While some UTF-8 strings might only consist of characters using 8 bits, any given UTF-8 character in a string can be up to four bytes (32 bits) in size. See en.wikipedia.org/wiki/UTF-8 or tools.ietf.org/html/rfc3629.

DaveE 2010-03-31 23:35:03

I see your point; I still think the heuristic will work for all intensive purposes.

fbrereto 2010-04-01 00:10:54

<pedantic>it's properly *for all intents and purposes*, e.g 'whatever your intent or whatever your purpose, this heuristic should work'.</pedantic>

DaveE 2010-04-12 21:40:21

Answer 2

+2 A:

You might be able to fix it up using a bit of domain knowledge. For example, Ã© is not a likely character combination in ISO-8859-1; it is much more likely to be UTF-8 é.

If your input is limited to a restricted pool of characters, you can also use a heuristic such as assuming Ã will never occur in your input stream.

Without this kind of domain knowledge, your problem is in general intractable.

Philip Potter 2010-03-31 18:10:37

The code will handle inputs in a wide variety of languages, so enumerating specific translations is unfortunately not an option.

knorv 2010-03-31 19:31:45

Answer 3

+1 A:

Take a look at this article. UTF-8 is optimised to represent Western language characters in 8 bits but it's not limited to 8-bits-per-character. The multibyte characters use common bit patterns to indicate if they are multibyte, and how many bytes the character uses. If you can safely assume only the two encodings in your string, the rest should be simple.

DaveE 2010-03-31 18:16:17

Answer 4

+1 A:

I have no useful advice to offer except that I would have tried using Encode::Guess first.

Sinan Ünür 2010-03-31 22:19:05

ansaurus

tags:

views:

answers:

Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl

related questions