tags:

views:

184

answers:

4

Hi,

Another utf-8 related problem I believe...

I am using php to update data in a mysql db then display that data elsewhere in the site. Previously I have run into utf-8 problems before where special characters are displayed as question marks when viewed in a browser but this one seems slightly different.

I have a number of records to enter that contain the è character. If I enter this directly in the db then it appears correctly on the page so I take this to mean that utf-8 content is being output correctly.

However when I try and update the values in the db through php, then the è character is replaced. What appears instead is & Atilde ; & uml ; (without the spaces) which appears in the browser as è

I have the tables in the database set to use UTF-8. I believe this is correct cos, as mentioned, if I update the db through phpMyAdmin, its all ok. Similarly I have set the character encoding for the page which seems to be correct. I am also running the sql statement "SET NAMES 'utf8';" before trying to update the db.

Anyone have any other ideas as to where the problem may lie?

Many thanks

A: 

I suppose you're taking the results of some form submission and inserting the results in the database. If so, you must ensure that you instruct the browser to send UTF-8 data and you should validate the user input for a valid UTF-8 stream.

Change your form element to include accept-charset:

<form accept-charset="utf-8" method="post" ... >
    <input type="text name="field" />
    ...
</form>

Validate the data with:

$valid = array_key_exists("field", $_POST) && !is_array($_POST['field']) &&
    preg_match('//u', $_POST['field']) && ...; //check length with mb_strlen etc.
Artefacto
+2  A: 

Yup.

The character you have is LATIN SMALL LETTER E WITH GRAVE. As you can see, in UTF-8 that character is encoded into two bytes 0xC3 and 0xA8.

But in many default, western encodings (such as ISO-8859-1) which are single-byte only, this multi-byte character is decoded as two separate characters, LATIN CAPITAL LETTER A WITH TILDE and DIAERESIS. Notice how they are both encoded as C3 and A8 in ISO-8859-1?

Furthermore, it looks like PHP is processing these characters through htmlentities() which result in the &Atilde; and &uml; respectively.

So, where exactly is the problem in your code? Well, htmlentities() could be doing it all by itself since its 3rd argument is a encoding name - which you may not have properly set to 'UTF-8'. But it could be some other string processing function as well. (Note: As a general rule, it's a bad idea to store HTML entities in the database - this step should be reserved for time of display)

There are a bunch of other ways to trip yourself up with UTF-8 in php - I suggest hitting up the cheatsheet and make sure you're in good shape.

Peter Bailey
Yup. A bit lengthy way to say "get rid of htmlentities".
Col. Shrapnel
I always like to explain exactly what's going on when encodings are involved. Anything I can do to elevate understanding is a win in my book.
Peter Bailey
Cheers for that. Much appreciated
Addsy
+1  A: 

Well it is your own code convert characters into entities.
To make it right:

  1. Ban htmlentities function from your scripts forever.
  2. Use htmlspecialchars, but not on insert, but whan displaying data.
  3. Repair existing data in the database using html_entity_decode.
Col. Shrapnel
A: 

I think you miss Content-Type declaration on the html page:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

If you don't have it, the browser will guess the encoding, and convert any characters outside of that encoding to entities when posting a form.

Marek