views:

91

answers:

3

Hello.. I have been looking thoroughly through the Web and I cannot seem to find a table with those kind of conversions. The ones I find have some mistakes and are not too reliable, so I have looked for some official table or alike, but unfortunately I haven't.. so here I am..

As mentioned in the title, what I want to do is for instance, know what does "ñ" stand for (this one I already know.. "ñ"), but not only for Spanish characters, but others (I already know the Polish ones).

Main problem is I have a string in PHP which sometimes may come as for instance "eñe" (which is ok) and others as "eñe".. and in the lattest I should be able to change it to "eñe" so it is readable.. but if it is ok I do not want to change it. In order to do this, I was using utf8_decode function, but in case the string is readable, it will still change the "ñ" to "■" (but white).. so that is why I cannot always decode the string, and if I use the mb_detect_encoding function, I will always get "UTF-8" as a response.. and it is not so helpful..

Once I know all of the utf8 bit chars written as for instance "ñ" for "ñ", "Ź" for "Ź", etc., I plan to do a function which will basically replace one to another.. which is sort of the same thing that the utf8_decode does.. unless someone here has a better solution!

Thanks in advance! Greetings!

+4  A: 

The problem is that once you have mojibake, there's no reliable way to convert it back to what it was supposed to mean. See this paragraph at Wikipedia for an explanation of the problem:

Consider a text file containing the German word für in the ISO-8859-1 encoding. This file is now opened with a text editor that assumes the input is UTF-8. As the first byte (0x66) is within the range 0x000x7F, UTF-8 correctly interprets it as an f. The second byte (0xFC) is not a legal value for the start of any UTF-8 encoded character. A text editor could therefore replace the byte with the replacement character symbol to warn the user that something went wrong. The last byte (0x72) is also within the code range 0x000x7F and can be decoded correctly. The whole string now displays like this: f�r.

A poorly-implemented text editor might save the replacement in UTF-8 form; the text file data will then look like this: 0x66 0xEF 0xBF 0xBD 0x72, which will be displayed in ISO-8859-1 again as f�r. The replacement also destroys the original byte, making it impossible to recover what character was intended.

You need to avoid incorrectly interpreting text using the wrong encoding from the beginning. Fixing it when it's broken is too late.

deceze
+1 because now I know a new, useful term *mojibake*.
alex
+1 for the same reason as alex
kskjon
Thanks for your post, I didnt know the term mojibake either!
+7  A: 

Why do you want to do this? Do you want to recover corrupted data or so?

It should really not be done as part of usual business code flow. All you need to do is to ensure that all layers of your webapp is using UTF-8 properly. The PHP source, the HTTP response header and body, the DB table, the DB connection, et cetera. See also PHP UTF-8 cheatsheet.

If you actually want to do this as an one-time task to recover corrupted data, then it's good to know that the corrupted data in your question indicates UTF-8 data which is incorrectly been stored or displayed as ISO-8859-1. You just need to read the data as ISO-8859-1 and write as UTF-8. One time. Then do it the right way.

As an evidence, the ñ (Unicode Character 'LATIN SMALL LETTER N WITH TILDE' (U+00F1)) exist in Unicode (UTF-8, a multi-byte encoding) of bytes 0xC3 and 0xB1. When those bytes are encoded using a single-byte encoding like ISO-8859-1, then the 0xC3 becomes à and the 0xB1 becomes ±. See also the ISO-8859-1 codepage layout.

BalusC
Addendum: although targeted on Java EE webdevelopers, you may find [this article](http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html) useful to learn more about character encodings.
BalusC
Indeed, I have some corrupted files I would like to fix. Plus I also have to fix some encoding problems that I have in the mail client an application I am developing has.. but with your answer the others, I have been able to think the solution in other ways. Problem is that whenever I want to get Russian and other Cyrillic encodings (I do use utf 8, but they do not), the function mb_detect_encoding fails as it says it is UTF8 but actually is KOI8-R.. so may be a regular expression detecting Cyrillic words would be handy if anyone happens to have one at hand (haven't found one yet) :)
@user Whenever you receive *any* text, you should always *know* its encoding, not guess it. Websites (should) have meta tags that denote the encoding, HTTP traffic in general and emails (should) have headers. It must be possible to automatically and reliably retrieve the encoding any given text is in, guessing just doesn't work reliably enough. If whoever is sending you this text doesn't also provide the information which encoding it is in (documentation saying "this is always KOI8-R" is valid too), report this as a problem to them.
deceze
A: 

Your problem is a problem of interpretation more than transcoding. On any modern computer, ñ is normally input as binary 0xc3b1,as it is its UTF-8 code. If you interpret that (without transcoding) in old iso-latin-15 code, you'll get 0xc3 = Ã followed by 0xb1 = ±. This is why there is no "table" : it is a display problem.

The best thing to do is avoid iso-latin entirely. It will cause you plenty of problems. The real way to fix your program is : use only utf-8 everywhere, it will save you a lot of time and headaches.

In the meantime, if you really want to fetch the equivalent iso-latin-15 string to your utf-8 input (which you don't, if you got the above right), you can pass your string to any code converter, asking it to convert utf-8 to iso-latin-15. One thing you should be careful of is double-transcoding. If you had a utf-8 string and mistakingly asked for a conversion from iso-lating-15 to utf-8, then you got a utf-8 string that actually says ñ, which is binary 0xc383c2b1. To get back the correct utf-8 string, the anwser is the same : ask to convert your mangled string from utf-8 to iso-latin-15, which will happily take 0xc383 and convert it into 0xc3, then 0xc2b1 and convert it into 0xb1, giving you a correct utf-8 string containing a correct ñ.

Especially for PHP and web applications, remember that many computers (and more and more in the future) will send you utf-8 by default.

Jean
Thanks for your reply too!