views:

142

answers:

2

When including HTML entities in an HTML document, do the entities need to be from the same character encoding set that the document is specified to be using?

For example, if I am going to use the copyright sign in an HTML document that is specified as UTF-8, is it necessary to use the Unicode HTML entity (©) or is it okay to use other entities, such as the ASCII HTML entity (©)?

Please explain your answer. I am aware that it will "work", but is there a case where it will not work?

Thanks!

+2  A: 

The beauty of the UTF-8 encoding is that you can actually just include the binary character. You don't need to encode it as an entity at all. Thusly: ©

Oh, you just want to know the difference between the two entities? There is none. One describes the byte in Hex and the other in decimal.

RibaldEddie
By the "binary character", do you mean that I could just copy the symbol from your answer and paste it into my HTML document (meaning it will look like the symbol in the code) and, as long as it's UTF-8, it's okay??
letseatfood
This. What's the point using UTF-8 if you're going to encode everything in entities? This is what Unicode is for!
You
Correct. Some characters still need to be entity encoded, but only those that have semantic meaning in HTML, like less than and greater than. But the copyright sign just works (tm).
RibaldEddie
@You My intention is not to "encode in entities", but to create an HTML document that is "correct". If UTF-8 allows for not encoding, then I won't encode. Also, what do you mean by "This."
letseatfood
@RibaldEddie - Thanks!
letseatfood
Sorry, that should be Just Works™.
RibaldEddie
@RibaldEddie - While your answer is helpful, it does not specifically answer my question. I should have been more specific. Could you comment on whether the actual entity used is important if using an encoding set other than UTF-8? I am interested in a more general manner.
letseatfood
The copyright symbol appears in ISO-8859-1 too. IF you are correctly declaring the encoding used in your HTML document so that browsers can properly display the text, you wouldn't need to use entities either for ISO-8859-1. Since ISO-8859-1 and UTF-8 have the same character map for the first byte, in which the copyright symbol appears, it's a fairly safe symbol to display in HTML without using an entity. Other characters that map to multiple bytes need to use the entity if the HTML document were advertised to the browser as an ISO-8859-1 document. If no entity exists for a multibyte char...
RibaldEddie
... then you must use some other encoding, most reasonably that would be UTF-8.
RibaldEddie
+3  A: 

© and © specify the same character - 169 is equivalent to hexadecimal A9. These both specify a copyright symbol. Character entities in HTML always refer to Unicode code points, this is covered in the HTML 4 Standard. Thus, even if your character set changes, your entities still refer to the same characters.

This also means that you can encode characters that don't actually appear within your character set of choice. I just created a document in the ISO-8859-1 character set, but it includes a Greek lambda. Also, ASCII is not able to directly encode a copyright symbol, but it can through character entities.

Edit: Reading the comments on the other answer, I want to clarify this a bit. If you are using UTF-8 as the character encoding for your document, you can, within the raw HTML source, write a copyright symbol just as-is. (You need to find some way to input it, of course: copy-paste being the usual.) UTF-8 will allow you to directly encode any symbol you want. ISO-8859-1 is much more limited, and ASCII even more so. For example, within my HTML, if my document is a UTF-8 document, I can do:

<p>Hi there. This document is ©2010. Good day!</p>

or:

<p>Hi there. This document is &#xA9;2010. Good day!</p>

or:

<p>Hi there. This document is &copy;2010. Good day!</p>

The first is only valid if the character set supports "©". The other two are always valid, but less readable. Whatever text editor you're using, if it is worth its weight, should be able to tell you what character set it is encoding the document in.

If you do this, you need to make sure your web server informs the client of the correct character set, or that your document declares it with something like:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

I've used UTF-8 there as an example. XHTML should have the character set in the opening <?xml ... ?> tag.

Thanatos
Correct, so if you have for some reason to use neither Latin 1 nor Unicode as your document character set, you can still display the copyright symbol using those entities.
RibaldEddie