ansaurus

Question

PHP - ___ encoding to UTF-8 - is there an end-all solution?

Answer 1

+11 A:

No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).

But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.

(Not to sound like a smartass, but that is the only way to deal with this well.)

Pekka 2010-06-11 21:05:01

RSS feeds are a common example of why this needs to be done. People uploading files or copying and pasting from a variety of different editors using different character sets on their computers.

Kerry 2010-06-11 21:07:33

@Kerry it's true for RSS feeds, but behind every feed there *is* an administrator who should be doing their job. About users copy+pasting: Point taken. That is a real scenario where it's sometimes impossible to define the encoding.

Pekka 2010-06-11 21:08:34

Copy-pasting text into a form is not an issue, because the browser wouldn't know how to show the text anyway if it was - it's converted to Unicode when placed on the clipboard, and the browser knows to convert the text to whatever encoding it's supposed to send. RSS feeds also should not be a problem due to the XML prolog - but if that's missing, then it will probably also fail in many other places, unless the encoding is UTF-8 or UTF-16.

Michael Madsen 2010-06-11 21:25:21

@Michael yeah, I had the same afterthought about the clipboard. It is conceivable however that things get garbled sometimes, with content taken from wrongly encoded web sites, external embeds in a web site in a different encoding... Still, copy+pasting may not be as serious an issue as I initially thought.

Pekka 2010-06-11 21:30:14

I can't rely on the admins of other RSS feeds to maintain their feed well and that would be a big fault to do so. I did not know that about the clipboard, which is good, but I would still like an end-all or something of a solution

Kerry 2010-06-13 18:41:06

@Kerry I understand your situation with the feeds, but if an RSS feed does not declare its character encoding correctly, it is *broken*. If you can't get the administrators to get it right, keeping a manual correction table (feed A is UTF-8, feed B is chinese simplified....) may be the easiest way to go.

Pekka 2010-06-16 16:57:02

I'm going to take for granted that it's english in this case and simply translate non-UTF-8 characters to UTF-8. That's the best option I've seen so far.

Kerry 2010-06-16 17:46:46

@Kerry how are you going to do that when you don't know the encoding?

Pekka 2010-06-16 17:54:11

Use regular expressions to catch anything outside of the normal range of UTF-8, parse through them in PHP using the `ord()` function and use an ascii table to either remove them or translate them the best I can. If it does not match a character I can convert, it will be removed.

Kerry 2010-06-16 18:07:14

@Kerry wouldn't it be easier to manually switch between encodings until the contents make sense?

Pekka 2010-06-16 18:11:02

Quite possibly, but I don't know how to do that. Any effort I've made to change the encoding has ended up with a blank string or more unreadable characters, and a blank string is far less acceptable than strange characters.

Kerry 2010-06-16 18:15:02

Answer 2

+1 A:

Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:

function forceToUtf8($string) {
    if (!mb_check_encoding($string)) {
        return false;
    }
    return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}

Dereleased 2010-06-11 21:11:42

I think that could be a great solution to many, though change that `return false` to `return $string`, but didn't work for me

Kerry 2010-06-11 21:17:09

mb_detect_encoding only recognizes a small set of encodings - UTF-8, UTF-7, ASCII, and a bunch of Japanese encodings. It won't work for most encodings out there.

Michael Madsen 2010-06-11 21:27:59

Answer 3

+6 A:

The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic. It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.

In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.

I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.

For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters Ã and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.

My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever. Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.

Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.

Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters Ã and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.

cdonner 2010-06-14 01:14:05

Thank you for this post -- I've updated my question and I feel that you may have the best answer. You gave me an idea of how it could possibly work, is there any sort of script or function out there that deals with?

Kerry 2010-06-14 02:14:05

There is no generic solution. It depends on your situation. If you can reduce the problem in some form or another, by limiting the number of encodings or the languages, for instance, there may be. Look at this post for example. http://stackoverflow.com/questions/805418/how-to-find-encoding-of-a-file-in-unix-via-scripts Every suggested solution seems to have limitations.

cdonner 2010-06-14 02:21:22

It looks like I'll be having to do some sort of find and replace anything that's outside of the range. I am restricting it to the english language.

Kerry 2010-06-14 03:30:44

Answer 4

A:

If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8

http://php.net/manual/en/function.utf8-encode.php

Fire-Dragon-DoL 2010-06-14 01:57:17

Yeah, I tried it, it also returns an empty string if it fails

Kerry 2010-06-14 02:03:19

According to the manual, `utf8encode` works for ISO-8859-1 strings only, so it's not really helpful for situations of unknown encodings.

Pekka 2010-06-16 16:54:55

ansaurus

tags:

views:

answers:

PHP - ___ encoding to UTF-8 - is there an end-all solution?

related questions