database Collation diffrence utf ? latin ? etc ?

views:

answers:

+1 Q:

database Collation diffrence utf ? latin ? etc ?

what is collation usage for database ? well for html utf-8 i know a bit, like for displaying other language type. but whats for in database ? im useing latin-1 ( default ), my friends told me that to use utf insted, when i ask why, they just dont know and said, others use it. so im questioning what does collation really does ? does it effect speed or something like that ?

sorry for the newbie question. btw best-practices tag is gone. :|

Thanks

Adam ramadhan

+1 A:

See this on Answers about collations. About character sets and codepages see wikipedia. FYI latin1 is suitable for english and west european countries. UTF-8 is suitable for a universal application that may contain for example Chinese, or Arabic text.

renick 2010-08-07 06:46:39

does it effect speed or anything else ? thanks btw.

Adam Ramadhan 2010-08-07 07:14:18

certain languages need more than one byte to represent a character. So UTF-8 is a variable-length character encoding. This has performance implications (both space/memory and speed). If you really need international support though there is no alternative.

renick 2010-08-07 07:21:48

if im a startup what will you choose ? btw if im not supporting internaional (english) latin1 is just fine ? thanks

Adam Ramadhan 2010-08-07 07:26:19

If you only handle English then any codepage (including latin1 and utf-8) will do.

renick 2010-08-07 07:34:35

I'll bet you €20 that someday you'll encounter non-ISO-8859-1 characters in English text.

dan04 2010-08-07 07:45:20

@dan04 hahaha, ok i think its better useing utf_ then :|

Adam Ramadhan 2010-08-07 07:47:07

@dan04 I agree unicode is mandatory in this day and age. 8bit charsets are a relic but .. you know relics have a way of hanging on indefinitely in our business.

renick 2010-08-07 08:00:50

I know what you mean. Even EBCDIC isn't dead yet, as I know from personal experience :(

dan04 2010-08-07 08:29:23

+1 A:

Database collation controls how strings are compared / sorted - and at a lower level how the stored character data is interpreted. Latin-1 will allow strings in the latin-1 code page to be stored / compared. If you need to handle e.g. Japanese text then UTF-8 is a much better choice.

Will A 2010-08-07 06:47:09

so it will effect the storage ?

Adam Ramadhan 2010-08-07 07:11:34

Yes - it will have an effect on the storage and the speed of data access - storing a typical Unicode character will require > 1 byte of storage storing most 'usual' Western characters will require just a single byte.

Will A 2010-08-07 07:21:28

whats the smallest and fastest coll ? latin1 ? i bet the most support chars + heavy is utf ( its mili sec difference i bet? )

Adam Ramadhan 2010-08-07 07:28:01

ups i think im wrong http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=/com.ibm.db2.luw.admin.nls.doc/doc/c0053263.html >The collation you choose can significantly impact the performance

Adam Ramadhan 2010-08-07 07:34:30

Strictly speaking, the database collation will effect sorting of character data. The reason to go with UTF-8, would be because you're storing UTF-8 data.

George Marian 2010-08-07 06:47:36

+1 A:

I think you have confused the term charset and collation

http://dev.mysql.com/doc/refman/5.0/en/charset-general.html

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

latin1 and utf8 are valid charsets in MySQL, while latin1_swedish_ci and utf8_general_ci are example collations.

tszming 2010-08-07 07:35:10

+3 A:

MySQL confuses the issue by having collations named after character encodings. They're separate concepts.

A collation determines how the relational operators (<, >, etc.) and ORDER BY clauses sort strings. Issues considered by collations are:

Are uppercase and lowercase letters considered equivalent?
Is whitespace significant?
Do accented letters sort equal to the unaccented versions, after the unaccented versions, or at the end?
Are digraphs like "ch" and "ll" sorted like separate letters?
Are Unicode compatibility equivalents like AᴬⒶＡ treated the same?

Some of these depend on the language.

A character encoding determines how text values get converted to and from byte sequences. For a good introduction, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

There are hundreds of different character encodings, most of the specific to a certain combination of operating system and locale. Most of them are supersets of US-ASCII, so if you're damn sure your data will be ASCII-only, it doesn't matter what encoding you use.

But if you need other characters, you need an encoding that can handle them. For Western languages, your choices are generally:

Single-byte encodings, of which the most common is ISO-8859-1. I think MySQL's Latin1 encoding is actually windows-1252, which is similar.
UTF-8, which is very popular these days.

The difference between the two is:

For Western European accented characters, UTF-8 requires 2 bytes while Latin-1 requires only 1 byte.
But other characters can't be represented in Latin-1 at all. UTF-8 can represent every possible Unicode character.

dan04 2010-08-07 08:26:28

ansaurus

tags:

views:

answers:

database Collation diffrence utf ? latin ? etc ?

related questions