views:

118

answers:

5

What is difference between UTF-8 and HTML entities?

+3  A: 

UTF-8 is an encoding scheme for byte-level encoding.

HTML entities provide a way to express many characters in the standard (usually ASCII) character space. It also makes them more human readable readable when UTF-8 is not available.

The main purpose of HTML Entities today is to make sure text that looks like HTML renders as text. For example, the Less than or Greater than operators (&lt; or &gt;) when placed in a certain order (i.e <text>) can accidentally render as HTML when the intent was for them to render as text.

Jweede
I'm sorry. Readable? I'd argue you should never use HTML entities and always use UTF-8 encoding for your pages.
Artefacto
Second that. Showing the actual real character is a lot more readable than some ascii combination that represents the same character. Of course you will need a font that supports those characters.
poke
@Artefacto: They're more *human* readable: it's a lot easier for a human to infer that ™ means ™ compared to â„¢, which is the same character, only UTF-8 encoded (and parsed as Latin-1). It is true that UTF-8 is the better choice, for a long list of reasons, but if someone ever has to look at the raw bytes, ™ will be more recognizable.
Michael Madsen
Yes, but that's not really a problem unless you're reading a page in one encoding as if it had another encoding, which is what you're doing. So unless you're reading html in circumstances that assume an ASCII encoding (e.g. tcpdump), html entities are not more readable.
Artefacto
I agree. UTF-8 is a much, much better choice. My answer originally didn't reflect this.
Jweede
+2  A: 

A ton. HTML entities are primarily intended there to escape HTML-markup so it can be displayed in HTML (not mix up display vs output). For instance, &gt; outputs a >, while > closes a tag. While you can produce full Unicode with HTML entities, it is very inefficient and downright ugly.

UTF-8 is a multi-byte encoding for Unicode, which covers how to display characters outside of the classic US ASCII code page without resorting to switching code pages and attempting to mix code pages. A single code point (think of it as a character, though that is not truly correct) can be made up of 6 bytes of data. It is for representing any character in and outside of the basic multilingual plane (BMP), such as accented characters, east asian characters, as well as celtic tree writing (Ogham) amongst other character sets.

Yann Ramin
A: 

UTF-8 is an encoding, htmlentities is a function for making user input safe to display on the page, so that HTML tags are not added directly to the markup. See the manual.

Lotus Notes
+1  A: 

See UTF-8 more as a means to losslessly and self-synchronising map a list of natural numbers to a bytestream so that you can get the natural numbers back (lossless) and if you just fall 'in the middle' of the stream that's not a big problem. (self-synchronizing)

Each natural umber just happens to represent a 'character'.

HTML entities is a way to represent those same natural numbers in a way like: &#127;, stands for the natural number 127, in unicode that being the DEL character.

In UTF-8 that's the bytestream: 0111 1111

Once you go above 127 it becomes more than one octet, therefore, 128 becomes: 1000 0001 1111 1111.

Two DEL chars in a row become 0111 1111 0111 1111. UTF-8 is designed in such a way, that it's always possible to retrieve the original list of 'unicode scalar values' from the bytestream, even though a bytestream of for instance 4 octets can map back to between 1 and 4 different of such scalar values. UTF-8 is thus 'variable length' as they call it.

Lajla
A: 

The "A" you see here on screen is not actually stored as "A" in the computer, it's rather a sequence of 1's and 0's. A character set or encoding specifies a way to encode characters in such a way. The ASCII character set only includes a handful of characters it can encode, almost exclusively limited to characters of the English language. But for historical reasons and technical limitations of the time, it used to be the character set of the internet (very early on).

Both UTF-8 and HTML entities can be used to encode characters that are not part of ASCII. HTML entities achieve this by giving a special meaning to special sequences of characters. Using it you can encode characters not covered by ASCII using only ASCII characters. UTF-8 (Unicode) does the same by simply extending the character set to include more characters. HTML entities are only "valid" in an environment where you bother to decode them, which is usually a browser. UTF-8 characters are universal in any application that supports the character set.

Text containing only characters covered by ASCII:

Price: $20 (UTF-8)
Price: $20 (ASCII with HTML entities)

Text containing European characters not covered by ASCII:

Beträge: 20€ (UTF-8)
Betr&auml;ge: 20&euro; (ASCII with HTML entities)

Text containing Asian characters, most certainly not covered by ASCII:

値段:二千円 (UTF-8)
&#x5024;&#x6BB5;&#xFF1A;&#x4E8C;&#x5343;&#x5186; (ASCII with HTML entities)

The problem with UTF-8 is that the client needs to understand UTF-8. For the last decade or so this has been of no concern though, as all modern computers and browsers have no problem understanding UTF-8. UTF-8 (Unicode) can encode virtually all characters in use today on this planet (with minor exceptions). Using it you can work with text "as-is". It should absolutely be the preferred encoding to save text in.

The problem with HTML entities is that normal characters take on a special meaning. When writing &auml;, it takes on the special meaning of "ä". If you actually intend to write "&auml;", you need to double encode the sequence as &amp;auml;.
HTML entities are also notoriously unreadable. You do not want to use them to encode "special" characters in normal text. In this capacity they're a kludge bolted onto an inadequate character set. Use Unicode instead.

The important use of HTML entities that is independent of the character set used is to separate HTML markup from text. HTML as well gives special meaning to special character sequences. <b>text</b> is a normal sequence of characters, but it has a special meaning for HTML parsers. If you intended to just write "<b>text</b>", you will need to encode it as &lt;b&gt;text&lt;/b&gt;, so the HTML parser doesn't mistake it for HTML tags.

deceze