tags:

views:

47

answers:

2

Hi,

I am going to be building a application which will be used by people all over Europe. I need to know which collation and character set would be best suited for user inputted data. Or should I make a separate table for each language. A article to something explaining this would be great.

Thanks :)

+4  A: 

Character set, without doubt, UTF-8. Collation, I am not sure there is a good answer to that, but you might want to read this report.

Amadan
Big but I will read it :) thanks for that. I believe that unicode is really good for performance, however that is not the priority when you are faced with different characters.
Oliver Bayes-Shelton
What I was going to say. Also useful: http://forums.mysql.com/read.php?103,187048,188748#msg-188748
Pekka
Collations control how sorting and searching deals with special characters, e.g. whether they are "normalized" (`È` > `E`) or treated as separate entities.
Pekka
@Pekka: I know what collation is, but there is no "best" - no one collation will work on all European languages. That's what I meant.
Amadan
@Amadan yup. My comments were directed at the OP for clarification, not you, sorry I didn't point that out.
Pekka
Ah, sorry :) I misread you. Have an upvote. :)
Amadan
+1  A: 

Unicode is a very large character set including nearly all characters from nearly all languages.

There are a number of ways to store Unicode text as a sequence of bytes - these ways are called encodings. All Unicode encodings (well, all complete Unicode encodings) can store all Unicode text as a sequence of bytes, in some format - but the number of bytes that any given piece of text takes will depend on the encoding used.

UTF-8 is a Unicode encoding that is optimized for English and other languages which use very few characters outside the Latin alphabet. UTF-16 is a Unicode encoding which is possibly more appropriate for text in a variety of European languages. Java and .NET store all text in-memory (the String class) as UTF-16 encoded Unicode.

Justice
Perfect, Thank you very much.
Oliver Bayes-Shelton
If you're limited to Europe, UTF-8 is better than UTF-16, space-wise. Only the Cyrillic countries will use many multibytes, and they are in minority in Europe. In all other countries, base ASCII (<128) characters significantly outnumber the "weird" characters. (Speed-wise, UTF-16 always makes more sense.) Source: I'm a linguist from one of the non-English European countries.
Amadan
Also, you're splitting hairs here: for most practical purposes (including, I strongly suspect, the OP's), the distinction between character set and character encoding is trivial. The question might be rephrased as: What to put for `CHARACTER SET` option in MySQL. If you put `UTF8`, MySQL will correctly assume you mean Unicode set, UTF-8 encoding.
Amadan
-1 your explanation of what an encoding is is good, but your UTF-8 claim is incorrect. UTF-8 is not limited to european characters - you probably mean ISO-8859-1. UTF-8 is a variable length encoding that is, to my knowledge, able to map all or most character sets in existence. UTF-8 is the accepted standard encoding for web sites and E-Mail because it has backwards compatibility with ASCII. mySQL doesn't support UTF-16 yet. Also, the answer does not deal with the more complex issue of *collation*.
Pekka
@Pekka: Justice did not say UTF-8 would be *limited* to European characters, he said it is *optimized*. However, only 7-bit ASCII characters are stored as one byte. Ohter Latin (and Cyrillic) characters are encoded with 2 bytes. Many Asian characters require 3 bytes.
PauliL
@Paulil yes, but "optimized" isn't true either. UTF-8 has no european bias except that the basic ASCII set can be displayed using one byte - which is an *english* bias but will not work for any europen language - `é` or `ü` or `ä` are already outside the 127 characters.
Pekka
Forgot about the Greeks. Cyrillic (Serbia, Montenegro, Macedonia, Bulgaria, Bosnia, Moldova, Belarus, Russia, Ukraine, that I can remember) and Greek (Greece).
Amadan