Total encoding confusion! :-)
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
- If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
- Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
- The table collation specifies rules for sorting.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
- The encoding is set using SET NAMES, e.g.
SET NAMES "utf8"
.
- If this does not match the table encoding, MySQL automatically converts data on the fly.
- If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.
Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
- As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.
- I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
- Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.
Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing SET NAMES "utf8"
after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.
That entity problem
I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."
This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.