views:

104

answers:

4

It seems to me if UTF-8 was the only encoding used everywhere ever, there would be a lot less issues with code:

  • Don't even need to think about encoding issues.
  • No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
  • Browsers don't need to wait for the <meta> tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
  • You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
  • More characters can be represented in UTF-8.
  • Other things I can't think of right now.

So why haven't the inferior encodings been nuked from space?

A: 

I don't think UTF-8 uses "2 bits" it's variable length. Also a lot of OS level code is UTF-16 and UTF-32 respectively, which means the choice is between ASCII or ISO-8859-1 for latin encodings.

Novikov
2 bits was meant to be 2 bytes. Edited the question.
Coronatus
Yes but it still stands that UTF-8 is anywhere between 1 to 4 bytes.
Novikov
@Coronatus but the point is, UTF-8 is *NOT* a 2-byte encoding. It's a variable-length encoding that uses from 1-4 bytes per character. That's one of its disadvantages compared to single-byte encodings: you have to worry about splitting a string in the middle of a character, can't tell how long a string is (in chars) without parsing each byte, and so forth.
David Gelhar
It's common to need to know how many *bytes* are in a string for memory allocation purposes. Or, less commonly, to know how many *terminal columns* a string takes for text-wrapping purposes. But how often do you need to know the number of *characters*?
dan04
+6  A: 

Why are EBCDIC, Baudot, and Morse still not nuked from orbit? Why did the buggy-whip manufacturers not close their doors the day after Gottlieb Daimler shipped his first automobile?

Relegating a technology to history takes non-zero time.

msw
True, but Unicode has been around for almost 20 years.
dan04
But Baudot has been around for more than 100 years and occupies only 70% of the space of wasteful ASCII!
msw
+8  A: 
  • Don't even need to think about encoding issues.

True. Except for all the data that's still in the old ASCII format.

  • No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.

Incorrect. UTF-8 is variable length, from 1 to 6 or so bytes.

  • Browsers don't need to wait for the tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.

Browsers don't generally wait for the full page, they make a guess based on the first part of the page data.

  • You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).

Except for all those other old web pages that use other non-UTF-8 encodings (the non-English speaking world is pretty big).

  • More characters can be represented in UTF-8.

True. Your problems of data validation just got harder, too.

Greg Hewgill
Good answer. Except for the first point, since UTF-8 can treat existing ASCII text as perfectly valid UTF-8. Which is not true for ISO-8859-1.
Avi
+1  A: 
dan04