views:

63

answers:

4

I'm looking into how characters are handled that are outside of the set characterset for a page.

In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.

As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed. I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.

Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...

+1  A: 

htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.

A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).

stereofrog
thanks, this is what I was looking for. As for the other comments, I know about utf-8 but that's for later, for now I need to fix the problem at hand which is getting rid of the escaped stuff in the database and I needed to know if I was on the right track
Maarten
Yes, HTML-encoded data in the database is a huge code smell. `htmlspecialchars` should be called on putting text into an HTML page, not anything to do with the data layer. Get rid!
bobince
@Maarten: don't forget that your data still needs escaping (see answer update).
stereofrog
A: 

He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).

btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.

wimvds
for security reasons htmlspecialchars should be used instead. and not at insert but at display part
Col. Shrapnel
agreed, don't make a mess of the input if you can avoid it, only filter on sql injection
Maarten
"only filter on sql injection" err, you have heard of XSS attacks right? There's more to security then checking for sql injection. btw It's just a basic guess of what the coders' motives might have been to use htmlentities, not my own view on how to implement security...
wimvds
A: 

Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.

using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.

Col. Shrapnel
bobince
Thanks a lot, my bad, I meant entities.
Col. Shrapnel
A: 

Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.

You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.

AlexV
mb_detect_encoding never can be trusted and useless too. Content-type of the page is enough
Col. Shrapnel
Content-type is usually enough but if the input is user defined, the string can be in a different charset than what content-type specify.
AlexV