views:

675

answers:

3

Some UTF-8 characters like the UTF-8 equivalent of C2 96 (hyphen). On the browser it displays it as – (utf box with 00 96). And not as '-'(hyphen). Any reasons for this behavior? How do we correct this?

http://stuffofinterest.com/misc/utf8.php?s=128 (Refer this URL for the codes)

I found that this can be handled with html entities. Is there any way to display this without converting to html entities?

+1  A: 

Two reasons come to mind:

  1. Are you sure that you have output the correct character code to the browser? Better check in some hex viewer.
  2. The font you are using doesn't have a glyph defined at this code point.
Vilx-
yes.. the character codes are correct. I have checked it through a hex viewer.
Krishna
+1  A: 

I suspect this is because the characters between U+0080 and U+009F inclusive are control characters. I'm still slightly surprised that they show differently when encoded directly in the HTML than using entities, but basically you shouldn't be using them to start with. U+0096 isn't really "hyphen", it's "start of guarded area".

See the U+0080-U+00FF code chart for more information. Basically, try to avoid control characters...

Jon Skeet
Thanks a lot. if a program encounters this, how do we handle? I have tried this over gmail, it does not display the – . It displays the "start of guarded area" as '–' Any ideas?
Krishna
How you want to handle this will depend on the application. You may want to strip the characters, or replace them with another Unicode character with similar display characteristics (e.g. use the proper hyphen character).
Jon Skeet
A: 

The character you're talking about is an en-dash, not a hyphen. Its Unicode code point is U+2013, and its UTF-8 encoding is E2 80 93, not C2 96. That table you linked to is incorrect. The first two columns have nothing to do with UCS-2 or Unicode; they actually contain the windows-1252 encodings for the characters in question. The columns labeled "UTF-8 Hex" and "UTF-8 Native" are just plain wrong, at least for the rows labeled 128 to 159. The entities – and – represent an en-dash, but the UTF-8 sequence C2 96 represents a non-displayable control character.

You shouldn't need to encode those characters manually anyway. Just tell your text editor (or whatever you use to create the content) to save the file as UTF-8.

Alan Moore
I acknowledge that it is not a hyphen. But it is definitely a UTF8 character. As suggested, http://unicode.org/charts/PDF/U0080.pdf indicated that the character is "Start of Guarded Area". It displays as a hyphen when used with html entities (–)
Krishna
No, the entity `–` does represent an en-dash. It's based on windows-1252 and is therefore technically incorrect, but browsers support it for historical reasons. The correct numerical entity for en-dash, based on its Unicode code point, is `–` or `–` hex.
Alan Moore