views:

211

answers:

5

I know this sounds really silly but what character encoding should I use for something that looks like this in UTF-8

�� Ã�¼Ã��Ã�½Ã�±Ã�¼Ã�Â

The website is in English. This is something user generated content which is stored in the database that is utf_general_ci and displayed on the screen . I just want to display it properly. What do I have to do ?

OK this is what the original text was something like

I αм iиvisibłє łiкє αiя--- I αм αs iмρøяŧαиŧ αs øxygєи--- I αм łiviиg iи ŧЋє wøяłd øƒ мy dяєαмz I αм αłwαys ŧЋєяє ŧø Ћєłρ øŧЋєяz--- I αм busy buŧ иєvєя igиøяє αиy øиє I αм ŧЋє øиє wЋø cαяєz--- I łøvє ŧø sєє øŧЋєя łαugЋiиg I αм ŧЋє øиє wЋø bøяяøw øŧЋєяz søяяøw I αм ŧЋє øиє wЋøz иαugЋŧy buŧ иicє I αм łøsŧ iи мy ŧЋøugЋŧs--- I łøvє ŧø ŧαłк--- I łøvє ŧø sЋαяє--- I αм яєαdy ŧø gø αиy wЋєяє--- I łøvє ŧø ƒły buŧ døи’ŧ Ћαvє wiиgs— I wαиŧ ŧøø ŧøucЋ ŧЋє sкy łiмiŧs--- I αм єvił buŧ иøŧ dєvił--- I иєvєя ƒøłłøw αиy ŧяєиd--- I αм ƒuиłøviиg--- suм ŧiмє łøvє ŧø bє αłøиє--- I łøvє ŧø łivє---

A: 

Just keep it in UTF-8.

Oded
But whats wrong with that string? I mean why doesn't it display properly, Its something submitted by a user of the website, so I don't know what it actually is.
atif089
Impossible to say without knowing the original encoding. BTW - If you explain the reasons for wanting to do something, you are more likely get an answer that will suit your needs.
Oded
+1  A: 

Using UTF-8 is just fine, but here is few checkpoints.

If you are using MySQL, set database/tables/fields collations in utf8_unicode_ci

and If you are using php, do mysql_query('SET NAMES utf8'); before query

and in HTML output use

<meta http-equiv="content-type" content="text/html; charset=utf-8" />
S.Mark
works half way, I mean when i use slect statements, but when I INSERT the same value everything goes off and this is shown Æ
atif089
btw, mysql_query('SET NAMES utf8'); need to use before INSERT too, and if you use phpmyadmin, you can confirm there for properly stored in db or not after you inserted.
S.Mark
+1  A: 

It might be more than a problem of choosing a display character set. That string unfortunately has a lot of replacement characters (�), which indicates that it's already gone through a process where characters have been lost because the incoming encoding wasn't understood. Even the fragment "�" is probably the replacement character in utf8 viewed through a single-byte encoding.

To check the quality of the information in the database, can you append the output of say select charset(colname), hex(left(colname, 20)) to the question?

d__
A: 

Users on you site could be entering characters in non-UTF-8, like big-5 or JIS. This is a problem: you need to either enforce that they're entering in UTF8, or somehow detect the character set they've used and then convert it to UTF8. Every locale has a default character set - for example if a user tells you that they should have a japanese interface it's likely they're using something like JIS, and you might be able to convert JIS->utf-8 on the way in, and then utf-8 to JIS on the way out. If you can't convert, just make sure you write utf-8 directive into your page's meta tag (if your interface is HTML), and enforce that only utf-8 characters make it into your database.

A: 

You may want to use following conversion functions for utf-handling:

utf8_decode
utf8_encode
iconv
Sarfraz