ansaurus

Question

PHP function iconv character encoding from iso-8859-1 to utf-8

Answer 1

+1 A:

iso-8859-1 doesn't contain the € sign so your string cannot be interpreted with iso-8859-1 if it contains it. Use iso-8859-15 instead.

Raoul Duke 2010-09-02 14:49:25

Answer 2

+1 A:

Those 2 characters are illegal in iso-8859-1 (did you mean iso-8859-15?)

$ php -r 'echo iconv("utf-8","iso-8859-1//TRANSLIT","ter € and • the");'
ter EUR and o the

Wrikken 2010-09-02 14:49:33

Answer 3

+4 A:

I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.

The confusion comes about because when you serve a page as text/html;charset=iso-8859-1, for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).

iconv('cp1252', 'utf-8', "\x80 and \x95")
-> "\xe2\x82\xac and \xe2\x80\xa2"

bobince 2010-09-02 15:07:05

Thank you bobince! Now it works. I want to ask you another question now.How can I check all the sites that are sets in text/html;charset=iso-8859-1 really is in cp1252? (how did you explained in the answer).

albertopriore 2010-09-03 11:24:35

If you see a byte in the range 0x80–0x9F, you are almost certainly looking at cp1252 rather than 8859-1, since the ‘C1 control codes’ are very rarely used (almost never, on the web). If the source of the “ISO-8859-1” string is web-based, it almost certainly means it's really cp1252, since that's what browsers use.

bobince 2010-09-03 21:43:14

I've tried to do this -> mb_detect_encoding($string, 'cp1252'); and then with the same string mb_detect_encoding($string, 'ISO-8859-1'); the first returns me 'false' the second returns me that it is an ISO-8859-1 string. But it isn't. How can I make a certain charset check?

albertopriore 2010-09-06 10:41:32

You can't make a certain charset check at all. Absolutely any sequence of bytes is a valid ISO-8859-1 string, and most single-byte encodings also map all or most bytes to valid characters. Only multi-byte encodings like UTF-8, where there are many invalid byte sequences, offer any realistic chance of ruling them out. So really you can only go on balance of probabilities, and the balance of probabilities when pitting cp1252 against ISO-8859-1 for text that's come from the web is always cp1252.

bobince 2010-09-06 21:59:32

thanks a lot bobince

albertopriore 2010-09-07 08:02:29

ansaurus

tags:

views:

answers:

PHP function iconv character encoding from iso-8859-1 to utf-8

related questions