I have started debugging my RSS feed because it has some strange characters in it (i.e. the missing-character glyph). I started with two excellent beginner resources:
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets: http://www.joelonsoftware.com/articles/Unicode.html
- Character Sets / Character Encoding Issues: http://www.phpwact.org/php/i18n/charsets
The reason I believe our RSS feed is having problems is because users are copy&pasteing MS Word documents into a textarea on the site and our PHP pages are using the "iso-8859-1" charset which is incompatible with the special "Windows-1252" encodings for things like bullet points and smart quotes used by MS Word.
So I'm hoping to fix the issue, all I'll need to do is start using "utf-8" in the pages that take/give user input??. I.e. set the following in the HEAD section:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
The real reason I'm raising this question though, is because my DB fields that store my user input are in "latin1_swedish_ci
" and I want to know whether I NEED to convert them to "utf8_general_ci
"? MySQL doesn't really care about the charset does it? It just sees a bunch of bytes and If I put Unicode into a field collated as Latin it'll still come back out as Unicode right? Changing the field will be tiresome because the field is part of a FULLTEXT index where the other fields will also need their collation changing which means dropping the index and rebuilding it (which is no small task when there's large amounts of TEXT involved).