views:

46

answers:

2

Hi,

Can php convert strings with all charset encodes to utf8?

Solutions that don't works:

  1. utf8_encode($string) - but its only Encodes an ISO-8859-1 string to UTF-8?
  2. iconv($incharset, $outcharset,$text) - but how can be find string current encodding? (only can be if string part of html dom document, not just string)

thanks

+1  A: 

In general, you cannot know the encoding a given string using.

All you can do is guess. There's mb_detect_encoding, which doesn't really work well and then there are more complex heuristics, such as those used by browsers, which employ language cues.

Artefacto
+5  A: 

It is possible to convert a string from any encoding supported by iconv() into UTF-8 in PHP.

but how can be find string current encodding?

You should never need to "find" the current encoding: Your script should always know what it is. Any resource you query, if properly encoded, will give you its encoding in the content-type header or through other means.

As Artefacto says, there is the possibility of using mb_detect_encoding() but this is not a reliable method. The data flow of the program should always have it defined what encoding a string is in (and preferably work with UTF-8 internally) - that's the way to go.

Pekka
@Pekka. Just curious, if the content-type isn't specified, what encoding should we assume it is and fall back on?
Ben
@Ben I'd say still ISO-8859-1 because the number of its users is probably still the greatest. But it would be a gross misconfiguration for a server not to return *any* character set info.
Pekka
@Ben: some value might be gained with `mb_detect_encoding`, but it is by no means reliable.
Wrikken
Ben
CP-1252 is often mislabeled as ISO-8859-1. If in doubt, it may be a good idea to use `mb_detect_encoding` to distinguish between those two.
troelskn
@Ben text files are indeed an exception because they have no means of specifying their encoding except for the (optional) BOM. Having to work with text files with unspecified encoding is indeed nasty.
Pekka