views:

720

answers:

5

What should be used and when ? or is it always better to use UTF-8 always? or ISO-8859-1 still has importance in specific conditions?

Is Character-set related to geographic region?


Edit:

Is there any benefit to put this code @charset "utf-8";

or like this <link type="text/css; charset=utf-8" rel="stylesheet" href=".." />

at the top of CSS file?

I found for this

If DreamWeaver adds the tag when you add embedded style to the document, that is a bug in DreamWeaver. From the W3C FAQ:

"For style declarations embedded in a document, @charset rules are not needed and must not be used."

The charset specification is a part of CSS since version 2.0 (may 1998), so if you have a charset specification in a CSS file and Safari can't handle it, that's a bug in Safari.

and add accept-charset in form

<form action="/action" method="post" accept-charset="utf-8">

and what should be use if i use xhtml doctype

<?xml version="1.0" encoding="UTF-8"?>

or

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+12  A: 

G'day,

I'd highly recommend having a read of Joel's excellent article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

It'll help you understand what's going on.

HTH

cheers,

Rob Wells
thanks will read this article
metal-gear-solid
+1  A: 

UTF-8 is supported everywhere on the web. Only in specific applications is it not. You should always use utf-8 if you can.

The downside is that for languages such as chinese, utf-8 takes more space than, say, utf-16. But if you don't plan on going chinese, or even if you do go chinese then utf-8 is fine.

The only cons against using utf-8 is that it takes more space compared to various encodings, but compared to western languages it takes almost no extra space at all, except for very special characters, and those extra bytes you can live with. We are in 2009 after all. ;)

Tor Valamo
Strictly speaking that's not the only con. Another con is that it's a variable-length encoding and some old code still stumbles across that fact.
Joachim Sauer
Yes, but as I said, I'm speaking about utf-8 on the web, and not in programming. ;)
Tor Valamo
+1  A: 

If you want world domination, use UTF-8 all the way, because this covers every human character available at the world, including Asian, Cyrillic, Hebrew, Arabic, Greek and so on, while ISO-8859 is only restricted to Latin characters. You don't want to have Mojibake.

BalusC
but if some character not showing in utf-8 in website then should i change charset utf-8 to ISO-8859 for just some character or is there any other solution?
metal-gear-solid
@BalusC, actually you have to go to UTF-16 to be able to cover "evry human character available in the world."
Rob Wells
@Rob Wells - So should we use UTF-16?
metal-gear-solid
@Rob: No, UTF-8 has every human character. The only difference is the way UTF-16 saves space on languages such as chinese, with different code points. UTF-16 is a very unstable charset, because it doesn't know when there is an error.
Tor Valamo
This is a very rare case as UTF-8 covers the same codepoints as the characters of ISO-8859-1 (but NOT all of the other ISO-8859-x sets!). Just use UTF-8 all the way and convert the "bad" characters if necessary. In terms of web development you need to ensure of at least the following: 1) save source code files in UTF-8. 2) set HTTP response header to UTF-8. 3) set HTTP request header to UTF-8 (if not set by client yet). 4) set database table to UTF-8.
BalusC
Oh, and 5) read/write local textfiles using UTF-8. I am not sure what your target language is, but if it was Java, you can find here more background information, practical examples and detailed solutions: http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html
BalusC
@BalusC: No, ALL ISO-8859-x characters, for any value of x, are also Unicode characters. All Unicode characters have a number/codepoint, and UTF-8 is just a variable-length encoding of that number. Therefore it follows that all of the +/- 800 characters in the different ISO-8859-x encodings have a UTF-8 encoding.
MSalters
@MSalters: Uh, that wasn't the point. I was talking about the **character** which is represented by the codepoint.
BalusC
In that case UTF-8 is irrelevant; it's merely an encoding. It encodes all Unicode characters. Each ISO-8859-x character set is a 256 character subset of Unicode; therefore each character from any ISO-8859-x has a Unicode codepoint, and therefore a UTF-8 encoding. This directly contradicts your "UTF-8 covers the same codepoints as the characters of ISO-8859-1 (but NOT all of the other ISO-8859-x sets!)" statement. If you still doubt it, please name 1 character from any ISO 8859 that is "not coveerd by UTF-8"
MSalters
@MSalters: This is a misunderstanding. The characters which are represented by the codepoints in ISO-8859-1 are exactly the same as in UTF-8. In for example ISO-8859-15 however, eight codepoints got a different character. E.g. codepoint `0xA4` got the euro sign `€` instead of the generic currency sign `¤`.
BalusC
Ohoh. You're sorely mistaken then about UTF-8. To use your same example, 0xA4 is NOT a valid UTF-8 character. It can be the second, third or forth byte of a UTF-8 character. For instance U+20A4 `₤` is the three-byte UTF-8 sequence 0xE2,0x82,0xA4, and the currency sign U+00A4 `¤` is the two-byte UTF-8 sequence 0xC2, 0xA4. (It's a coincidence the 0xA4 repeats; U+00E4 is NOT 0xC2, 0xE4 for instance)
MSalters
Sigh. Yes, I know that UTF-8 is multibyte, but I wasn't talking about that at all.
BalusC
Well, that's the defining characteristic of the UTF-8 encoding. Similarly, UTF-16 is a multi-word encoding of Unicode. And UTF-8 being **multi** byte is precisely why it's possible that it covers all ISO-8859-x single-byte characters sets, not just -1 - see your comment on Dec 12th.
MSalters
+9  A: 

Unicode is taking over and has already surpassed all others. I suggest you hop on the train right now.

Note that there are several flavors of unicode. Joel Spolsky gives an overview.

Unicode is winning

nes1983
The majority of the Web is UTF-8 now: http://w3techs.com/technologies/overview/character_encoding/all
dan04
A: 
  • ISO-8859-1 is a great encoding to use when space is a premium and you are only ever going to want to encode characters from the basic latin languages it supports. And you are never ever ever going to ever have to ever contemplate ever upgrading your application to support non latin languages.

  • utf8 is a fantastic way to (a) use the large code base of 8bits per character code libraries there are that already exist, or (b) be a euro snob. utf8 encodes standard ascii in 1 byte per character, latin 1 in 2 bytes per character, eastern european and asian languages get 3 bytes per character. It possibly goes up to 4 bytes per character if you start trying to encode ancient languages that dont exist in the basic multilingual plane.

  • utf16 is a great way to start a new codebase from scratch. Its completely culture neutral - everone gets a fair handed 2 bytes per character. It does need 4 bytes per character for ancient/exotic languages - which means - in the worst case - its as bad as its big brother:

  • utf32 is a waste of space.

Chris Becke
utf16 is *culture neutral*? Everyone gets a *fair-handed 2 bytes*? Rather than overlaying culture value judgments into the discussion, why not keep it to a concise cost/benefit analysis? To wit: If the characters being encoded are primarily ascii or latin, then UTF16 is a waste of space. If not, then not. Whether it is a "new codebase" is irrelevant.
Cheeso
utf16 has the advantage that you can move a cursor backwards in it. Shouldn't be neglected.
nes1983
utf-16 is a very bad web encoding, because it is extremely incompatible with any other encoding, and if there is an error in the byte stream, it will not register that, and keeps going as if nothing happened, causing every subsequent character to be plain wrong. Even one missing bit does this.
Tor Valamo
UTF-16 is "completely culture neutral - everone gets a fair handed 2 bytes per character", except those cultures for which you need 4 bytes per character? Is this a parody of Orwell? :-)
Ken
UTF-32 (and related schemes) takes more space, but less time: random access is O(1), which is why many languages that support full Unicode characters tend to use this internally.
Ken
Niko: advantage over what? Can't you move a cursor backwards in UTF-8 and UTF-32, also?
Ken
@Ken: do they? Both Java and .NET use UTF-16. They don't use UTF-32!
Joachim Sauer
By the way: ISO-8859-1 isn't even enough when you only need latin languages. It doesn't support the Euro sign €, which is pretty darn important. For that you'd need to go to ISO-8859-15 (or better yet: an encoding that can represent all Unicode codepoints such as the UTF-* family)
Joachim Sauer
@Ken: in UTF-32 you can, but UTF-8 no. It's because UTF-8 is a variable-length code: http://en.wikipedia.org/wiki/Variable-length_code. You can only go forward in UTF-8.
nes1983
@Niko: You can go backwards in UTF-8 as well, you just need to go back so many bytes until the most significant bit is again 0 (which means it's the last byte of the character). Slightly modified if the endianness is different. And moving forward (not appending) has that same problem if the endianness is different. But it's very solvable.
Tor Valamo
i should probably have said culture agnostic. I just mean that, with a web site its very easy for english people especially to assume that all users will be happy being restricted to latin 1.
Chris Becke