MySQL confuses the issue by having collations named after character encodings. They're separate concepts.
A collation determines how the relational operators (<
, >
, etc.) and ORDER BY
clauses sort strings. Issues considered by collations are:
- Are uppercase and lowercase letters considered equivalent?
- Is whitespace significant?
- Do accented letters sort equal to the unaccented versions, after the unaccented versions, or at the end?
- Are digraphs like "ch" and "ll" sorted like separate letters?
- Are Unicode compatibility equivalents like AᴬⒶA treated the same?
Some of these depend on the language.
A character encoding determines how text values get converted to and from byte sequences. For a good introduction, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
There are hundreds of different character encodings, most of the specific to a certain combination of operating system and locale. Most of them are supersets of US-ASCII, so if you're damn sure your data will be ASCII-only, it doesn't matter what encoding you use.
But if you need other characters, you need an encoding that can handle them. For Western languages, your choices are generally:
The difference between the two is:
- For Western European accented characters, UTF-8 requires 2 bytes while Latin-1 requires only 1 byte.
- But other characters can't be represented in Latin-1 at all. UTF-8 can represent every possible Unicode character.