views:

1488

answers:

8

Hello guys.

I need help with a character encoding problem that I want to sort once and for all. Here is an example of some content which I pull from a XML feed, insert into my database and then pull out. http://pastebin.com/d78d24f33 As you can see, a lot of special html characters get corrupted/broken.

How can I once and for all stop this? How am I able to support all types of characters, etc?

I've tried literally every piece of coding I can find, it sometimes corrects it for most but still others are corrupted.

Thanks guys.

+2  A: 

My favorite article about encodings from JoelOnSoftware: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Paul G.
Quite useful but still kinda not helping me so far :( I need some PHP specific advice
James
That Joel article is/was helpful, but it sort or side steps the fact that unicode support is "broken/wonky" is so many products that it's really hard to define what being a good citizen is. much less implement it.
Alan Storm
A: 

First off, make sure your database's character encoding is set to support UTF-8. Secondly, PHP's ICONV is going to be your friend. Finally, ensure that your response headers are sending the proper character encoding (again, UTF-8).

Jordan S. Jones
+1  A: 

It seems that an UTF-8 encoded text is interpreted with ISO 8859-1.

If you’re processing XML documents, you have to use the encoding given either in the charset parameter in HTTP header field Content-Type or in the encoding attribute in the XML declaration. If none of both is given, the XML specification declares UTF-8 or UTF-16 as the default character encoding and you have to use some detection.

Gumbo
A: 

Did you try utf8_encode() and utf8_decode()?

Which one you use will depend entirely on how your data is encoded, which you don't specify, but they are quite useful for this kind of cases.

Seb
A: 
header('Content-type: text/html; charset=UTF-8') ;

/**
 * Encodes HTML safely for UTF-8. Use instead of htmlentities. 
 *
 * @param string $var 
 * @return string 
 */
function html_encode($var)
{
    return htmlentities($var, ENT_QUOTES, 'UTF-8');
}

Those two rescued me and I think it is now working. I'll come back if I continue to encounter problems. Should I store it in the DB, eg as "&" or as "&"?

James
Why do you use character references although UTF-8 can encode overy character? Using `htmlspecialchars` to replace just the HTML special character will suffice, if you really have to replace them.
Gumbo
+1  A: 

It looks like the link you gave has data that is encoded in utf-8. (Follow that link, then change the encoding of your browser to utf-8).

I sounds like you are having problems with inserting and retrieving from your database. Make sure your database table has utf-8 set as the encoding.

John
+2  A: 

To absolutely once and for all make sure you will never have problems with encoding again:

Use UTF-8 everywhere and on everything!

That is (if you use mysql and php):

  • Set all the tables in your database to collation "utf8_general_ci" for example.
  • Once you establish the database connection, run the following SQL query: "SET NAMES 'utf8'"
  • Always make sure the settings of your editor are set to UTF-8 encoding.
  • Have the following meta tag in the section of your HTML documents:

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

And couple of bonus tips:

Petrunov
A: 

After you connect to the database, but before you do any transactions, execute the following line which makes sure all database communication is in UTF-8:

mysql_query("SET character_set_results = 'utf8', character_set_client = 'utf8', character_set_connection = 'utf8', character_set_database = 'utf8', character_set_server = 'utf8'", $dbconn);

Christian