views:

407

answers:

5

What are the implications of a change from UTF-8 to UTF-16 for HTML encoding? I would like to know your thoughts on the issue. Are there things I need to think of before making such a change?

Note: Interested due to enormous amounts of japanese and chinese text I need to handle.

+4  A: 
  • Your bandwidth consumption is likely to nearly double, assuming most of your HTML is ASCII
  • Clients which incorrectly assume UTF-8 (or ASCII) will be confused

Why do you want to change to UTF-16?

Jon Skeet
Or bandwith consumption might nearly halve.
JacquesB
Yes, if most of your HTML is non-ASCII. Of course, given that the HTML tag and attribute names themselves are ASCII, it would have to contain a good "content to markup" ratio.
Jon Skeet
The OP mentions large amounts of chinese and japanese text, but good point about the markup.
JacquesB
Ah - the Chinese and Japanese text bit was added after I'd answered :)
Jon Skeet
A: 

I suspect most browsers won't even show your pages.

Martin Cote
+2  A: 

There is also the byte order which becomes an issue with anything above 8-bit data. UTF encoded files begin with a byte order mark which is used to determine the byte order, or endianness, of that file.

Wikipedia has a quite good explanation of this.

FeatureCreep
+5  A: 

I can think of a few things that will go wrong:

  1. You MUST specify that it's UTF-16 in the HTTP header. Unlike UTF-8, UTF-16 is not ASCII compatible, which means that everything needs to be in UTF-16 from the start.
  2. Older clients don't support UTF-16. For example, anything on Windows 9x. Possibly Mac OS9 as well.
  3. Oh, wait, I almost forgot: North America and European copies of Windows XP don't have Asian fonts installed by default.
R. Bemrose
re 3: That issue is independant of whether the characters are encoded in UTF-8 or UTF-16.
JacquesB
True, but I thought I'd throw it in as long as I was listing problems.
R. Bemrose
+1  A: 

As far as I know all modern browsers support UTF-16 encoding. But as others have pointed out, you should declare the encoding explicitly. Not all browsers and platforms will support all unicode characters, but I think this is regardless of which encoding you use.

However, if bandwith is a big issue you should probably consider gzipping the HTML. This will save much more bandwidth than switching encoding.

JacquesB