views:

104

answers:

2

Hello,

I got MySQL DB which contains UTF8 column with such "ТеÑ" records. PHP's mb_detect_encoding() told me that this is UTF-8. How can I transform this "horror" into something readable?

Thank you

+1  A: 

output to page with UTF8 encoding specified. browser will show it in readable form.

Andrey
+8  A: 

I'm guessing you've got the byte string "\xd0\xa2\xd0\xb5\xd1", then, which would be the UTF-8 encoded form of the characters Те (plus one following byte which is half a character).

If you merely echo() that on a page that you have declared as being UTF-8, it should display correctly on the browser:

 <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
 ...

 something: <?php echo htmlspecialchars($something); ?>

This naturally also means you will need to save the .php file itself using the UTF-8 encoding, if it has any non-ASCII characters in. (Many Windows text editors tend not to save as UTF-8 by default, sadly.)

If you must have a non-UTF-8 page, you would have to using iconv() to convert the string to whatever encoding you were using, presumably Windows code page 1251 for Russian ('cp1251'). But I would strongly recommend using UTF-8 for everything all the way through.

edit re comment:

if I'm doing mysql_set_charset("utf8", $db) before selecting row - I'm getting this "horror"

mysql_set_charset('utf8') is indeed the right thing to do. Check you are including the meta as above, and that the browser is seeing it (check View->Encoding is UTF-8).

If you are getting Ð¢ÐµÑ even with UTF-8 correctly getting sent, then I'm afraid the current contents of your database are messed up. Perhaps data had been inserted previously without the correct mysql_set_charset call, or maybe you did an SQL import that used the wrong charset.

If this is the case, you're likely going to have to go through each row of the database and ‘fix’ it by using iconv() to convert UTF-8 to ISO-8859-1. This should undo the double-UTF-8-encoding.

[edit:2]

iconv("UTF-8", "ISO-8859-1", $row['name']) saying Notice: iconv(): Detected an illegal character in input string.

OK, so the input isn't a valid UTF-8 sequence. That could either be because you're not getting UTF-8 out of the database after all, or because a UTF-8 sequence has become truncated. For example your string "\xd0\xa2\xd0\xb5\xd1" (which, read as ISO-8859-1, looks like "ТеÑ"), is not valid, as the final "Ñ" is only half of a two-byte UTF-8 sequence. As UTF-8 in a browser it would render as Те�.

If that's what you have in your database you'll need to fix the data in there before you can proceed.

it's ok if I echo $row['name'] without doing mysql_set_charset("utf8", $db)

You haven't confirmed that you are correctly sending UTF-8 and that the browser knows this (by checking View->Encoding), so it's not really meaningful what you see on-screen when you echo(); we can't work out what the original byte string was from that.

Tell us what you see when you echo bin2hex($row['name']);. This will convert each byte in the string into hex digits, so "\xd0\xa2\xd0\xb5\xd1" would come out as d0a2d0b5d1, if that's what you've got.

bobince
Alternatively you can set the encoding using the header() function, ie: header('Content-type: text/html; charset=utf-8')
quantumSoup
iconv("UTF-8", "ISO-8859-1", $row['name']) saying Notice: iconv(): Detected an illegal character in input string. But it's ok if I echo $row['name'] without doing mysql_set_charset("utf8", $db) (by default charset is latin1)
Kirzilla