views:

310

answers:

2

I changed from latin1 to utf8. Although all sorts of text was displaying fine I noticed non-english characters were stored in the database as weird symbols. I spent a day trying to fix that and finally now non-english characters display as non-english characters in the database and display the same on the browser. However I noticed that I see apostrophes stored as ' and exclamation marks stored as !. Is this normal, or should they be appearing as ' and ! in the database instead? If so, what would I need to do in order to fix that?

A: 

It really depends on what you intend to do with the contents of the database. If your invariant is that "contents of the database are sanitized and may be placed directly in a web page without further validation/sanitization", then having & and other html entities in your database makes perfect sense. If, on the other hand, your database is to store only the raw original data, and you intend to process it/sanitize it, before displaying it in HTML code, then you should probably replace these entities with the original characters, encoded using UTF-8. So, it really depends on how you interpret your database content.

Michael Aaron Safyan
The application is a paid forum software script. In that case, do you think it's necessary to store the entities?
rein
A: 

The &#XX; forms are HTML character entities, implying you passed the values stored in the database through a function such as PHP's htmlspecialchars or htmlentities. If the values are processed within an HTML document (or perhaps by any HTML processor, regardless of what they're a part of), they should display fine. Outside of that, they won't.

This means you probably don't want to keep them encoded as HTML entities. You can convert the values back using the counterpart to the function you used to encode them (e.g. html_entity_decode), which should take an argument as to which encoding to convert to. Once you've done that, check some of the previously problematic entries, making sure you're using the correct encoding to view them.

If you're still having problems, there's a mismatch between what encoding the stored values are supposed to use and what they're actually using. You'll have to figure out what they're actually using, and then convert them by pulling them from the DB and either converting them to the target encoding before re-inserting them, or re-inserting them with the encoding that they actually use. Similar to the latter option is to convert the columns to BLOBs, then changing the column character set, then changing the column type back to a text type, then directly converting the column to the desired character encoding. The reason for this unwieldy sequence is that text types are converted when changing the character encoding, but binary types aren't.

Read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more on character encodings in general, and § 9.1.4. of the MySQL manual, "Connection Character Sets and Collations", for how encodings are used in MySQL.

outis
The application I'm using is forum script and I'm not too familiar with the code. Does that mean that they are doing something incorrectly? In other words, is it a problem with the application, or how I set up my db? Is there an easy way to check whether it's the app, db, server, etc. that is the problem.
rein