best character set and collation for European based website.

tags:

mysql
sql

views:

answers:

+2 Q:

best character set and collation for European based website.

Hi,

I am going to be building a application which will be used by people all over Europe. I need to know which collation and character set would be best suited for user inputted data. Or should I make a separate table for each language. A article to something explaining this would be great.

Thanks :)

+4 A:

Character set, without doubt, UTF-8. Collation, I am not sure there is a good answer to that, but you might want to read this report.

Amadan 2010-06-28 11:06:36

Big but I will read it :) thanks for that. I believe that unicode is really good for performance, however that is not the priority when you are faced with different characters.

Oliver Bayes-Shelton 2010-06-28 11:07:15

What I was going to say. Also useful: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

Pekka 2010-06-28 11:08:48

Collations control how sorting and searching deals with special characters, e.g. whether they are "normalized" (`È` > `E`) or treated as separate entities.

Pekka 2010-06-28 11:10:17

@Pekka: I know what collation is, but there is no "best" - no one collation will work on all European languages. That's what I meant.

Amadan 2010-06-28 11:31:39

@Amadan yup. My comments were directed at the OP for clarification, not you, sorry I didn't point that out.

Pekka 2010-06-28 11:50:14

Ah, sorry :) I misread you. Have an upvote. :)

Amadan 2010-06-28 11:57:22

+1 A:

Unicode is a very large character set including nearly all characters from nearly all languages.

There are a number of ways to store Unicode text as a sequence of bytes - these ways are called encodings. All Unicode encodings (well, all complete Unicode encodings) can store all Unicode text as a sequence of bytes, in some format - but the number of bytes that any given piece of text takes will depend on the encoding used.

UTF-8 is a Unicode encoding that is optimized for English and other languages which use very few characters outside the Latin alphabet. UTF-16 is a Unicode encoding which is possibly more appropriate for text in a variety of European languages. Java and .NET store all text in-memory (the String class) as UTF-16 encoded Unicode.

Justice 2010-06-28 11:20:40

Perfect, Thank you very much.

Oliver Bayes-Shelton 2010-06-28 11:21:38

If you're limited to Europe, UTF-8 is better than UTF-16, space-wise. Only the Cyrillic countries will use many multibytes, and they are in minority in Europe. In all other countries, base ASCII (<128) characters significantly outnumber the "weird" characters. (Speed-wise, UTF-16 always makes more sense.) Source: I'm a linguist from one of the non-English European countries.

Amadan 2010-06-28 11:26:31

Also, you're splitting hairs here: for most practical purposes (including, I strongly suspect, the OP's), the distinction between character set and character encoding is trivial. The question might be rephrased as: What to put for `CHARACTER SET` option in MySQL. If you put `UTF8`, MySQL will correctly assume you mean Unicode set, UTF-8 encoding.

Amadan 2010-06-28 11:30:30

-1 your explanation of what an encoding is is good, but your UTF-8 claim is incorrect. UTF-8 is not limited to european characters - you probably mean ISO-8859-1. UTF-8 is a variable length encoding that is, to my knowledge, able to map all or most character sets in existence. UTF-8 is the accepted standard encoding for web sites and E-Mail because it has backwards compatibility with ASCII. mySQL doesn't support UTF-16 yet. Also, the answer does not deal with the more complex issue of *collation*.

Pekka 2010-06-28 11:49:23

@Pekka: Justice did not say UTF-8 would be *limited* to European characters, he said it is *optimized*. However, only 7-bit ASCII characters are stored as one byte. Ohter Latin (and Cyrillic) characters are encoded with 2 bytes. Many Asian characters require 3 bytes.

PauliL 2010-06-28 12:08:32

@Paulil yes, but "optimized" isn't true either. UTF-8 has no european bias except that the basic ASCII set can be displayed using one byte - which is an *english* bias but will not work for any europen language - `é` or `ü` or `ä` are already outside the 127 characters.

Pekka 2010-06-28 12:10:05

Forgot about the Greeks. Cyrillic (Serbia, Montenegro, Macedonia, Bulgaria, Bosnia, Moldova, Belarus, Russia, Ukraine, that I can remember) and Greek (Greece).

Amadan 2010-06-28 13:11:57

ansaurus

tags:

views:

answers:

best character set and collation for European based website.

related questions