views:

82

answers:

2

Is there a one stop solution to solving all character encoding issues? I always seem to have issues somewhere along the line between user input, database storage and data retrieval (html forms. I want all my data and web pages to be encoded as utf-8 but it seems I always end up with a invalid utf-8 character somewhere.

I don't really under stand character encoding too well but since I started to work with french characters I am forever getting problems. One of the other developers urlencodes everything before sending it to the database and then urldecodes everything again which makes me shudder.

As I understand it, an html form will accept any characters depending on the users environment and it's up to the server-side to try convert it to UTF-8 or whatever is prefered?

Any further info will be greatly appreciated!

+1  A: 

Using UTF-8 throughout is the one stop solution. Unfortunately, it comes along with understanding the problems that occur in practice. If you have a specific problem, post a specific question on SO.

Wrt. HTML forms: no, it's not really up to the user's environment. The browser will (or should - most actually do) send data in the same encoding that the page had on which the form occurred. Make sure that every HTML page you send to the user has a charset= field in the HTTP Content-type header; for good measure, also put a http-equiv meta tag into the HTML file itself (which helps in case the user cached or saved the HTML page). So when the HTML page is in UTF-8, the data sent by the browser are also in UTF-8.

Martin v. Löwis
Thanks for clearing up the gray area regarding user form input
bananarepub
A: 

In my projects the first query which is sent to my database is

SET NAMES 'utf8';

Simply after estabilishing a MySQL connection.

The same for data dumps too. When I'm doing a database dump to a .sql file, I insert at the beginning the above query.

It works for me for few years without problems on many hosting companies and dedicated servers.

astropanic
i assume the collation for my tables will be "utf8_general_ci"?
bananarepub
That depends on what native language You're using. For example I'm developing my apps in Poland, so I use utf8_polish_ci, because the Polish alphabet consist of accentent letters (ą,ę,ć,ł,ó etc) and therefore I need that MySQL knows, when it sort text data, that Ł is after L and so on.
astropanic