views:

67

answers:

5

what is collation usage for database ? well for html utf-8 i know a bit, like for displaying other language type. but whats for in database ? im useing latin-1 ( default ), my friends told me that to use utf insted, when i ask why, they just dont know and said, others use it. so im questioning what does collation really does ? does it effect speed or something like that ?

sorry for the newbie question. btw best-practices tag is gone. :|

Thanks

Adam ramadhan

+1  A: 

See this on Answers about collations. About character sets and codepages see wikipedia. FYI latin1 is suitable for english and west european countries. UTF-8 is suitable for a universal application that may contain for example Chinese, or Arabic text.

renick
does it effect speed or anything else ? thanks btw.
Adam Ramadhan
certain languages need more than one byte to represent a character. So UTF-8 is a variable-length character encoding. This has performance implications (both space/memory and speed). If you really need international support though there is no alternative.
renick
if im a startup what will you choose ? btw if im not supporting internaional (english) latin1 is just fine ? thanks
Adam Ramadhan
If you only handle English then any codepage (including latin1 and utf-8) will do.
renick
I'll bet you €20 that someday you'll encounter non-ISO-8859-1 characters in English text.
dan04
@dan04 hahaha, ok i think its better useing utf_ then :|
Adam Ramadhan
@dan04 I agree unicode is mandatory in this day and age. 8bit charsets are a relic but .. you know relics have a way of hanging on indefinitely in our business.
renick
I know what you mean. Even EBCDIC isn't dead yet, as I know from personal experience :(
dan04
+1  A: 

Database collation controls how strings are compared / sorted - and at a lower level how the stored character data is interpreted. Latin-1 will allow strings in the latin-1 code page to be stored / compared. If you need to handle e.g. Japanese text then UTF-8 is a much better choice.

Will A
so it will effect the storage ?
Adam Ramadhan
Yes - it will have an effect on the storage and the speed of data access - storing a typical Unicode character will require > 1 byte of storage storing most 'usual' Western characters will require just a single byte.
Will A
whats the smallest and fastest coll ? latin1 ? i bet the most support chars + heavy is utf ( its mili sec difference i bet? )
Adam Ramadhan
ups i think im wrong http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=/com.ibm.db2.luw.admin.nls.doc/doc/c0053263.html >The collation you choose can significantly impact the performance
Adam Ramadhan
A: 

Strictly speaking, the database collation will effect sorting of character data. The reason to go with UTF-8, would be because you're storing UTF-8 data.

George Marian
+1  A: 

I think you have confused the term charset and collation

http://dev.mysql.com/doc/refman/5.0/en/charset-general.html

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

latin1 and utf8 are valid charsets in MySQL, while latin1_swedish_ci and utf8_general_ci are example collations.

tszming
+3  A: 

MySQL confuses the issue by having collations named after character encodings. They're separate concepts.

A collation determines how the relational operators (<, >, etc.) and ORDER BY clauses sort strings. Issues considered by collations are:

  • Are uppercase and lowercase letters considered equivalent?
  • Is whitespace significant?
  • Do accented letters sort equal to the unaccented versions, after the unaccented versions, or at the end?
  • Are digraphs like "ch" and "ll" sorted like separate letters?
  • Are Unicode compatibility equivalents like AᴬⒶA treated the same?

Some of these depend on the language.

A character encoding determines how text values get converted to and from byte sequences. For a good introduction, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

There are hundreds of different character encodings, most of the specific to a certain combination of operating system and locale. Most of them are supersets of US-ASCII, so if you're damn sure your data will be ASCII-only, it doesn't matter what encoding you use.

But if you need other characters, you need an encoding that can handle them. For Western languages, your choices are generally:

The difference between the two is:

  • For Western European accented characters, UTF-8 requires 2 bytes while Latin-1 requires only 1 byte.
  • But other characters can't be represented in Latin-1 at all. UTF-8 can represent every possible Unicode character.
dan04