views:

549

answers:

9

What's your opinion on encoding accented and special characters in XHTML and XML.

  • Do you convert each and every non-US-ASCII character to named entity?
  • You use ISO-8859-x or Win-125x and encode to entities anything else?
  • Or do you directly write everything in UTF-8, without bothering about entities?

Please elaborate on which and why.

+7  A: 

I can't tell you exactly why this happens, but in my 5 year experience of using UTF-8 for every web page (I mostly use cyrillic and baltic symbols), I haven't yet seen any character displayed incorrectly.

Sergej Andrejev
+3  A: 

Don't bother with named entities. They are good for when you need to manually edit HTML files and want to be able to read the characters, and don't have a UTF-8 editor. But otherwise, UTF-8 is the way to go.

Ned Batchelder
A: 

Speaking from an American point of view: where almost all text is US-ASCII, with a few symbols and accented characters, I strongly recommend using numeric or named entities.

The reason is simple: it's one less thing to worry about. You don't need to ensure that your webserver is set to advertise the same encoding as your content. Because sooner or later you'll get someone editing pages on Windows, using Cp1252 encoding, and someone else working on Linux with ISO-8859, and although the two are close they're not the same. And if the webserver is configured as UTF-8, they're both broken.

That said, I gave Sergej +1, because you don't want a mass of entities if you're working with text that isn't primarily ASCII.

kdgregory
+1 there is something to it. I've got Linux with everything UTF-8 by default, but webdesigners encode everything ISO-8859-1. But then 'autodetect encoding' option in editors comes handy :-)
vartec
The only way this holds up is if you are building static web pages and you have direct contact with everyone involved. Even then, you still have to deal with people who do not convert to entities, which is just as much of a headache to explain as how to save files in UTF8.For regular web applications this attitude is dangerous, because you may end up with a link in the chain that is not encoding aware, thus permanently leaving all user data irreparably corrupted. Regardless of whether you choose to use entities, you need to get your encodings straight or you are in for a world of hurt.
dasil003
Part of making a development team work is communication. However, it's usually easier to communicate within the team than without, and in many companies, deployment is managed separately from development. As for managing encodings through the web-app stack: if your platform doesn't do this for you, you're in a world of hurt period. But hey, thanks for the late downvote.
kdgregory
+2  A: 

I always write in utf8 directly. The only issue I've had during this period was server who was forcing iso encoding on headers.

Alekc
+5  A: 

UTF-8.

It was designed exactly with the purpose of solving the problems kdgregory mentions that occur with UTF-16 and it does it fantastically. Pretty much every editor today (including Notepad) has support for UTF-8, and it is also a default encoding for XML.

Nemanja Trifunovic
+1  A: 

Always use UTF-8 for you site

  1. There is no objections/problems in supporting UTF-8 by modern frameworks and databases servers.

  2. You will avoid issues, when someone put text in differ language than expected and you get ?????? instead of some unicode symbols or even worse when page template doesn't even been rendered.

  3. Even your site tageted to one language without multilingual interface (in future also), someone may one to publish on your site material and get comments from his friends in their own language.

Regards, Pavel

se_pavel
A: 

I personally always use UTF-8. It is well supported and every language, OS, and browser supports it somehow. Entities are nice to display, but they are a pain in the neck to edit. Named entities can refer to a lot of characters, but will only cover occidental character sets. For asian languages you will have to go back to hex entities and that is not pretty. Hexadecimal entities also have to be decoded or encoded using the Unicode tables anyway, so you might want to use a unicode flavor to encode you text in the first place.

If your main audience is english, you might be thinking that you can get away with ISO-8859-1 or cp1252 but that would be a mistake. Sooner or later somebody is going to write accented or other foreign characters and when that happens, it is too late to fix your encoding: some text is already screwed up.

Here are a bunch of further reading that have saved me a lot of headaches when playing around with charsets:

Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Is a detailed introduction to character sets and their usage and difference by joelonsoftware.com. The information there is quite general, but is useful to help figure out which encoding to chose.

Character sets from Browser to Database is a very practical and pragmatic article from SUN that covers a whole lot about the various places where you have to verify that your encoding not being converted to something else.

What Is UTF-8 And Why Is It Important? is another article by SUN, that goes deep into the nitty gritty of UTF-8, and should be allowed to answer any question you have on the details of UTF-8 after having read the first 2 articles.

LordOfThePigs
A: 

If I am working on a web site primarily in the ASCII space (English, most Romance languages), I convert everything non-ASCII to named or numbered entities. This makes it possible for me or other people without appropriate fonts to work on it. It might seem unlikely, but one day you'll end up using some godforsaken terminal over SSH that doesn't do UTF-8 and even if it does the host system won't have the right fonts installed.

If I'm writing text that's mostly not in ASCII, I'll use UTF-8. If the text is all entities that's just as unreadable as Unicode replacement boxes anyway.

Joe
A: 

The first 128 characters of Unicode are compatible with ASCII. A text written with those 128 characters is both a valid ASCII and UTF-8 document. Unicode is a standard and should be used by everyone. English language speakers will not see a difference, but non-English will. Personally, I am quite disappointed with the software and its creators, if it is not able to store and display even my last name correctly.

I must also notice that character encoding is only the first of whole series of problems concerning internalization. It can be especially noticed in smaller pieces of software that are unsually not designed to handle various non-English grammar issues at all.

Zyx
Of course 7-bit ASCII is a base of UTF-8. But that doesn't help even en English only text. You'll have ©, ¢, ½ …
vartec