views:

98

answers:

1

I'm having some issues with using PHP to convert ISO-8859-1 database content to UTF-8. I am running the following code to test:

// Connect to a latin1 charset database 
// and retrieve "Georgia O’Keeffe", which contains a "’" character
$connection = mysql_connect('*****', '*****', '*****');
mysql_select_db('*****', $connection);
mysql_set_charset('latin1', $connection);
$result = mysql_query('SELECT notes FROM categories WHERE id = 16', $connection);
$latin1Str = mysql_result($result, 0);
$latin1Str = substr($latin1Str, strpos($latin1Str, 'Georgia'), 16);

// Try to convert it to UTF-8
$utf8Str = iconv('ISO-8859-1', 'UTF-8', $latin1Str);

// Output both
var_dump($latin1Str);
var_dump($utf8Str);

When I run this in Firefox's source view, making sure Firefox's encoding setting is set to "Western (ISO-8859-1)", I get this:

asd

So far, so good. The first output contains that weird quote and I can see it correctly because it's in ISO-8859-1 and so is Firefox.

After I change Firefox's encoding setting to "UTF-8", it looks like this:

asd

Where did the quote go? Wasn't iconv() supposed to convert it to UTF-8?

+2  A: 

U+2019 RIGHT SINGLE QUOTATION MARK is not a character in ISO-8859-1. It is a character in windows-1252, as 0x92. The actual ISO-8859-1 character 0x92 is a rarely-used C1 control character called "Private Use 2".

It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling but it is not standard behaviour and care should be taken to avoid generating these characters in ISO-8859-1 labeled content.

It appears that this is what's happening here. Change "ISO-8859-1" to "windows-1252".

dan04
Wow, I did that and I see the U+2019 in UTF-8 mode! But is it safe to use "windows-1252" to convert a large amount of data from "ISO-8859-1" to "UTF-8"? In other words, will all of the ISO-8859-1 characters still convert correctly?
mattalexx
The characters 0x80-0x9F will not convert correctly. But these are control characters which are almost never used.
dan04