views:

219

answers:

3

I have a web-form that is getting customer information. The form is processed using PHP 5.1.6 so using the filter_input() will not work.

The form has a text area along with some other standard fields. I am getting lots of strange formatting codes that are being stored in our MySQL database. How can I filter these results where they look take the formatting from the string?

Before
Where is the babyâ??s room? I think itâ??s in the north side of the house. Do you know where it is?

After (this is what I would like to see)
Where is the baby's room? I think it's in the north side of the house. Do you know where it is?

+1  A: 

They are sending utf8 strings to your database and your database is probably expecting latin. First make sure that your table is set to utf8, if it isn't set it as utf8. Then try running this query before anything else:

set names utf8

This will set the connection's charset to utf8. Then you can run your other queries.

And read this:

http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

Galen
A: 

After reading your post the only thing i can tell you is to check your form tag and add the "accept-charset" attribute to the value you want.

Since this is a problem you have with the data posted from the form then you might want to check it.

If you have this problem after you store data in the database table then you might want to check with the database table string collations/charsets.

andreas
Don't use `accept-charset`. It doesn't work (and is actively harmful) on IE. Instead set the charset of the page containing the form to the one you want for the submission.
bobince
+1  A: 

This is almost certainly due to somebody cut-and-pasting from an incompatable character set.

Best guess, they're doing it from Microsoft Word which uses non-ASCII unicode symbols for quotes and speech marks, rather than the normal friendly ASCII ones. UTF-8 encodes as a series of two or more bytes from 80-FF, but often webpages are set to "latin-1" which treats those bytes as individual characters, usually a-z with accents. If you see 2 or 3 characters where you'd expect one, and some of them are weird latin symbols with accents, it almost always means your data is utf-8 but it's being rendered by something as latin-1.

The exception is when it was typed in as utf-8 and the system expected latin-1 in the first place. Make sure your HTML form sets a character type explicitly.

In general, I would recommend you make all your web-based tools use "utf-8" without a damn good reason not to.

Update: This is a little icky, but I suggest you don't worry about the database charset too much. Really we should just default to utf-8, but if you put utf-8 data into a latin-1 table, what comes back out will still be utf-8, so long as nothing asks the database what it thinks its sending you. What I'm saying is that it's unlikely to by MySQL causing the problem. It's either the cut-and-paste, the browser, the web form, your code to read from the form and write to the db, or also quite possible, your code to read from the db and render HTML.

Try checking the character encoding of the page with the iffy chars in. Try setting it to utf-8 or forcing utf-8 in the browser. View->Character Encoding in firefox.

Christopher Gutteridge
+1 for UTF-8. It's nearly 2010, we shouldn't still be writing applications that can't Unicode.
bobince