views:

174

answers:

3

I'm trying to convert a string from iso-8859-1 to utf-8. But when I find these two charachter € and • the function returns a charachter that is a square with two number inside.

How can I solve this issue?

+1  A: 

iso-8859-1 doesn't contain the € sign so your string cannot be interpreted with iso-8859-1 if it contains it. Use iso-8859-15 instead.

Raoul Duke
+1  A: 

Those 2 characters are illegal in iso-8859-1 (did you mean iso-8859-15?)

$ php -r 'echo iconv("utf-8","iso-8859-1//TRANSLIT","ter € and • the");'
ter EUR and o the
Wrikken
+4  A: 

I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.

The confusion comes about because when you serve a page as text/html;charset=iso-8859-1, for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).

iconv('cp1252', 'utf-8', "\x80 and \x95")
-> "\xe2\x82\xac and \xe2\x80\xa2"
bobince
Thank you bobince! Now it works. I want to ask you another question now.How can I check all the sites that are sets in text/html;charset=iso-8859-1 really is in cp1252? (how did you explained in the answer).
albertopriore
If you see a byte in the range 0x80–0x9F, you are almost certainly looking at cp1252 rather than 8859-1, since the ‘C1 control codes’ are very rarely used (almost never, on the web). If the source of the “ISO-8859-1” string is web-based, it almost certainly means it's really cp1252, since that's what browsers use.
bobince
I've tried to do this -> mb_detect_encoding($string, 'cp1252'); and then with the same string mb_detect_encoding($string, 'ISO-8859-1'); the first returns me 'false' the second returns me that it is an ISO-8859-1 string. But it isn't. How can I make a certain charset check?
albertopriore
You can't make a certain charset check at all. Absolutely any sequence of bytes is a valid ISO-8859-1 string, and most single-byte encodings also map all or most bytes to valid characters. Only multi-byte encodings like UTF-8, where there are many invalid byte sequences, offer any realistic chance of ruling them out. So really you can only go on balance of probabilities, and the balance of probabilities when pitting cp1252 against ISO-8859-1 for text that's come from the web is always cp1252.
bobince
thanks a lot bobince
albertopriore