ansaurus

Question

ISO-8859-1 vs UTF-8 ?

Answer 1

+12 A:

G'day,

I'd highly recommend having a read of Joel's excellent article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

It'll help you understand what's going on.

HTH

cheers,

Rob Wells 2009-12-12 16:45:43

thanks will read this article

metal-gear-solid 2009-12-12 16:47:30

Answer 2

+1 A:

UTF-8 is supported everywhere on the web. Only in specific applications is it not. You should always use utf-8 if you can.

The downside is that for languages such as chinese, utf-8 takes more space than, say, utf-16. But if you don't plan on going chinese, or even if you do go chinese then utf-8 is fine.

The only cons against using utf-8 is that it takes more space compared to various encodings, but compared to western languages it takes almost no extra space at all, except for very special characters, and those extra bytes you can live with. We are in 2009 after all. ;)

Tor Valamo 2009-12-12 16:47:10

Strictly speaking that's not the only con. Another con is that it's a variable-length encoding and some old code still stumbles across that fact.

Joachim Sauer 2009-12-12 18:01:06

Yes, but as I said, I'm speaking about utf-8 on the web, and not in programming. ;)

Tor Valamo 2009-12-12 18:23:51

Answer 3

+1 A:

If you want world domination, use UTF-8 all the way, because this covers every human character available at the world, including Asian, Cyrillic, Hebrew, Arabic, Greek and so on, while ISO-8859 is only restricted to Latin characters. You don't want to have Mojibake.

BalusC 2009-12-12 16:47:46

but if some character not showing in utf-8 in website then should i change charset utf-8 to ISO-8859 for just some character or is there any other solution?

metal-gear-solid 2009-12-12 16:50:26

@BalusC, actually you have to go to UTF-16 to be able to cover "evry human character available in the world."

Rob Wells 2009-12-12 16:50:32

@Rob Wells - So should we use UTF-16?

metal-gear-solid 2009-12-12 16:52:50

@Rob: No, UTF-8 has every human character. The only difference is the way UTF-16 saves space on languages such as chinese, with different code points. UTF-16 is a very unstable charset, because it doesn't know when there is an error.

Tor Valamo 2009-12-12 16:53:31

This is a very rare case as UTF-8 covers the same codepoints as the characters of ISO-8859-1 (but NOT all of the other ISO-8859-x sets!). Just use UTF-8 all the way and convert the "bad" characters if necessary. In terms of web development you need to ensure of at least the following: 1) save source code files in UTF-8. 2) set HTTP response header to UTF-8. 3) set HTTP request header to UTF-8 (if not set by client yet). 4) set database table to UTF-8.

BalusC 2009-12-12 16:53:31

Oh, and 5) read/write local textfiles using UTF-8. I am not sure what your target language is, but if it was Java, you can find here more background information, practical examples and detailed solutions: http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html

BalusC 2009-12-12 16:58:36

@BalusC: No, ALL ISO-8859-x characters, for any value of x, are also Unicode characters. All Unicode characters have a number/codepoint, and UTF-8 is just a variable-length encoding of that number. Therefore it follows that all of the +/- 800 characters in the different ISO-8859-x encodings have a UTF-8 encoding.

MSalters 2010-01-26 16:17:15

@MSalters: Uh, that wasn't the point. I was talking about the **character** which is represented by the codepoint.

BalusC 2010-01-26 16:52:23

In that case UTF-8 is irrelevant; it's merely an encoding. It encodes all Unicode characters. Each ISO-8859-x character set is a 256 character subset of Unicode; therefore each character from any ISO-8859-x has a Unicode codepoint, and therefore a UTF-8 encoding. This directly contradicts your "UTF-8 covers the same codepoints as the characters of ISO-8859-1 (but NOT all of the other ISO-8859-x sets!)" statement. If you still doubt it, please name 1 character from any ISO 8859 that is "not coveerd by UTF-8"

MSalters 2010-01-27 08:35:14

@MSalters: This is a misunderstanding. The characters which are represented by the codepoints in ISO-8859-1 are exactly the same as in UTF-8. In for example ISO-8859-15 however, eight codepoints got a different character. E.g. codepoint `0xA4` got the euro sign `€` instead of the generic currency sign `¤`.

BalusC 2010-01-27 11:24:05

Ohoh. You're sorely mistaken then about UTF-8. To use your same example, 0xA4 is NOT a valid UTF-8 character. It can be the second, third or forth byte of a UTF-8 character. For instance U+20A4 `₤` is the three-byte UTF-8 sequence 0xE2,0x82,0xA4, and the currency sign U+00A4 `¤` is the two-byte UTF-8 sequence 0xC2, 0xA4. (It's a coincidence the 0xA4 repeats; U+00E4 is NOT 0xC2, 0xE4 for instance)

MSalters 2010-01-27 11:52:50

Sigh. Yes, I know that UTF-8 is multibyte, but I wasn't talking about that at all.

BalusC 2010-01-27 12:04:17

Well, that's the defining characteristic of the UTF-8 encoding. Similarly, UTF-16 is a multi-word encoding of Unicode. And UTF-8 being **multi** byte is precisely why it's possible that it covers all ISO-8859-x single-byte characters sets, not just -1 - see your comment on Dec 12th.

MSalters 2010-01-28 09:03:00

Answer 4

+9 A:

Unicode is taking over and has already surpassed all others. I suggest you hop on the train right now.

Note that there are several flavors of unicode. Joel Spolsky gives an overview.

Unicode is winning

nes1983 2009-12-12 16:50:47

The majority of the Web is UTF-8 now: http://w3techs.com/technologies/overview/character_encoding/all

dan04 2010-08-03 02:40:09

Answer 5

A:

ISO-8859-1 is a great encoding to use when space is a premium and you are only ever going to want to encode characters from the basic latin languages it supports. And you are never ever ever going to ever have to ever contemplate ever upgrading your application to support non latin languages.
utf8 is a fantastic way to (a) use the large code base of 8bits per character code libraries there are that already exist, or (b) be a euro snob. utf8 encodes standard ascii in 1 byte per character, latin 1 in 2 bytes per character, eastern european and asian languages get 3 bytes per character. It possibly goes up to 4 bytes per character if you start trying to encode ancient languages that dont exist in the basic multilingual plane.
utf16 is a great way to start a new codebase from scratch. Its completely culture neutral - everone gets a fair handed 2 bytes per character. It does need 4 bytes per character for ancient/exotic languages - which means - in the worst case - its as bad as its big brother:
utf32 is a waste of space.

Chris Becke 2009-12-12 17:12:12

utf16 is *culture neutral*? Everyone gets a *fair-handed 2 bytes*? Rather than overlaying culture value judgments into the discussion, why not keep it to a concise cost/benefit analysis? To wit: If the characters being encoded are primarily ascii or latin, then UTF16 is a waste of space. If not, then not. Whether it is a "new codebase" is irrelevant.

Cheeso 2009-12-12 17:34:04

utf16 has the advantage that you can move a cursor backwards in it. Shouldn't be neglected.

nes1983 2009-12-12 17:42:49

utf-16 is a very bad web encoding, because it is extremely incompatible with any other encoding, and if there is an error in the byte stream, it will not register that, and keeps going as if nothing happened, causing every subsequent character to be plain wrong. Even one missing bit does this.

Tor Valamo 2009-12-12 17:46:14

UTF-16 is "completely culture neutral - everone gets a fair handed 2 bytes per character", except those cultures for which you need 4 bytes per character? Is this a parody of Orwell? :-)

Ken 2009-12-12 17:47:00

UTF-32 (and related schemes) takes more space, but less time: random access is O(1), which is why many languages that support full Unicode characters tend to use this internally.

Ken 2009-12-12 17:58:55

Niko: advantage over what? Can't you move a cursor backwards in UTF-8 and UTF-32, also?

Ken 2009-12-12 18:01:01

@Ken: do they? Both Java and .NET use UTF-16. They don't use UTF-32!

Joachim Sauer 2009-12-12 18:01:54

By the way: ISO-8859-1 isn't even enough when you only need latin languages. It doesn't support the Euro sign €, which is pretty darn important. For that you'd need to go to ISO-8859-15 (or better yet: an encoding that can represent all Unicode codepoints such as the UTF-* family)

Joachim Sauer 2009-12-12 18:02:54

@Ken: in UTF-32 you can, but UTF-8 no. It's because UTF-8 is a variable-length code: http://en.wikipedia.org/wiki/Variable-length_code. You can only go forward in UTF-8.

nes1983 2009-12-12 18:35:12

@Niko: You can go backwards in UTF-8 as well, you just need to go back so many bytes until the most significant bit is again 0 (which means it's the last byte of the character). Slightly modified if the endianness is different. And moving forward (not appending) has that same problem if the endianness is different. But it's very solvable.

Tor Valamo 2009-12-12 21:26:54

i should probably have said culture agnostic. I just mean that, with a web site its very easy for english people especially to assume that all users will be happy being restricted to latin 1.

Chris Becke 2009-12-12 22:48:02

ansaurus

tags:

views:

answers:

ISO-8859-1 vs UTF-8 ?

related questions