tags:

views:

29

answers:

2

I am proposing to convert my windows-1252 xhtml web pages to utf-8.

I have the following character entities in my coding (all preceded by &#):
39; - apostrophe
9658; - a right pointer
9668; - a left pointer

If I change the chartset and save the pages as utf-8 using my editor:
- the apostrophe remains in as a character entity;
- the pointers are converted to symbols within the code (presumably because the entities are not supported in utf-8?).

Questions:
1) If I understand utf-8 correctly, you don't need to use the entities and can type characters directly into the code. In which case is it safe for me to replace #39 with a typed in apostrophe?

2) Is it correct that the editor has placed the pointer symbols directly into my code and will these be displayed reliably on modern browsers, it seems to be ok? Presumably, I can't revert to the entities anyway, if I use utf-8?

Thanks.

A: 

It's charset, not chartset.

1) it depends on where the apostrophe is used, it's a valid ASCII character as well so depending on the characters intention (wether its for display only (inside a DOMText node) or used in code) you may or may not be able to use a literal apostrophe.

2) if your editor is a modern editor, it will be using utf sequences instead of just char to display text. most of the sequences used in code are just plain ASCII (and ASCII is a subset of utf8) so those characters will take up one byte. other characters may take up two, three or even four bytes in a specialized manner. they will still be displayed to you as one character, but the relation between character and byte has become different.

Anyway; since all valid ASCII characters are exactly the same in ASCII, utf8 and even windows-1252. you should not see any problems using utf8. And you can still use numeric and named entities because they are written in those valid characters. You just don't have to.

P.S. All modern browsers can do utf8 just fine. but our definitions of "modern" may vary.

Kris
A: 

Entities have three purposes: Encoding characters it isn't possible to encode in the character encoding used (not relevant with UTF-8), encoding characters it is not convenient to type on a given keyboard, and encoding characters that are illegal unescaped.

► should always produce ► no matter what the encoding. If it doesn't, it's a bug elsewhere.

directly in the source is fine in UTF-8. You can do either that or the entity, and it makes no difference.

' is fine in most contexts, but not some. The following are both allowed:

<span title="Jon's example">This is Jon's example</span>

But would have to be encoded in:

<span title='Jon&#x27;s example'>This is Jon's example</span>

because otherwise it would be taken as the ' that ends the attribute value.

Jon Hanna
Thanks Jon, some of my keywords include apostrophes, do you know how search engines interpret the entities? For example do they see widget#39;s the same as widget's? I have been wondering if they stop at the entity and just see widget. This would be a good reason for me not to use the entity in this circumstance.
cranfan
A search engine that couldn't follow the basic rules of HTML to the extent that it knows `'` in source is the same as `'` (or even that `J` is the same as `J`, there's just never much point doing that) isn't going to be worth worrying about. As it is, they'll not only understand that its an apostrophe, they'll even be quite sophisticated in working out whether or not to include the apostrophe in matching it to search terms, etc.
Jon Hanna