tags:

views:

890

answers:

8

Why indeed? Wouldn't be something like &br; more appropriate?

+18  A: 

A tag and a character entity reference exist for different reasons - character entities are stand-ins for certain characters (sometimes required as escape sequences - for example & for an ampersand &), tags are there for structure.

The reason the <br> tag exists is that HTML collapses whitespace. There needs to be a way to specify a hard line break - a place that has to have a line break. This is the function of the <br> tag.

There is no single character that has this meaning, though U+2028 LINE SEPARATOR has similar meaning, and even if it were to be used it would not help as it is considered to be whitespace and HTML would collapse it.

See the answers from @John Kugelman and @John Hanna for more detail on this aspect.


Not entirely related, there is another reason why a &br; character entity reference does not exist: a line break is defined in such a way that it could have more than one character, see the HTML 4 spec:

A line break is defined to be a carriage return (&#x000D;), a line feed (&#x000A;), or a carriage return/line feed pair.

Character entities are single character escapes, so cannot represent this, again in the HTML 4 spec:

A character entity reference is an SGML construct that references a character of the document character set.

You will see that all the defined character entities map to a single character. A line break/new line cannot be cleanly mapped this way, thus an entity is required instead of a character entity reference.

This is why a line break cannot be represented by a character entity reference.

Regardless, it not not needed as simply using the Enter key inserts a line break.

Oded
Hmmm... interesting. I've always thought it was there for historical reasons and grouped it together with the `b` s and `s` s, into the deprecated bin.
Yi Jiang
But `` is an entity reference and not just a character reference. It sure can represent more than just a single character.
Gumbo
-1 I don't see how this is relevant at all. The reason a hard line break indicator is needed is because whitespace in HTML is collapsed and newlines are ignored. It doesn't have anything to do with Windows using `\r\n` for line endings.
John Kugelman
@John Kugelman - The question isn't "Why is it needed?" the question akin to "What is the difference between an HTML element and an HTML entity?" Oded has demonstrated that an HTML entity has to be representative of a single character, while a new line sometimes needs two characters; therefore a single HTML entity wouldn't cut it according to the spec.
LeguRi
That really has nothing to do with it. Different line ending encoding standards is a total red herring. The problem isn't that there's no way to represent a line ending in one character, it's that **HTML doesn't differentiate between spaces, tabs, and newlines**: they're all whitespace, and newlines don't get special treatment.
John Kugelman
Indeed, in cases where whitespace is significant (in `<pre>` element) the different line-endings are all normalised and not an issue at all. This answer is completely misleading.
Jon Hanna
That white space immediately before a start tags or immediately after close tags is ignored does not explain why there is a `BR` element.
Gumbo
@John Kugelman:Well, it **does** differentiate between a space and an ` `.
slacker
@Gumbo - fair enough, removed reference to whitespace before and after a tag.
Oded
@slacker, of course it differentiates between a space and a   It also differentiates between a space and a 'T' or an 's' or a '!'. Space is the same as U+0020 and is collapsable whitespace,   is the same as U+00A0 and not collapsable whitespace. They are completely different characters.
Jon Hanna
@Jon Hanna:That's right. But ` ` is whitespace too! The comment I referred to alleged that all whitespace is equivalent in HTML. It is not.
slacker
@slacker - Now that I think about it, I think Jon's right! If you put ` ` (ASCII 32 being 'space') it still gets collapsed into a single space; I think of it as though the entity doesn't render in the document, but in the source... like putting a unicode escape sequence in Java source.
LeguRi
@slacker: U+00A0 is not white space in the terms of HTML (see http://www.w3.org/TR/html4/struct/text.html#whitespace).
Gumbo
@slacker, it's not collapsible whitespace, nor a whitespace convertable in HTML, nor whitespace convertable in XML, nor whitespace advised as convertable by Unicode (who only advise and give very detailed information on the semantics to aid taking or abandoning that advice). It is permissible by the semantics of U+00A0 to collapse them into a sinlge non-breaking space (not done by any browser, which is also permissible) but not to consider it as marking a break between one word and the next. To not consider it a break between one *ML token and the next therefore makes sense.
Jon Hanna
@Richard, unicode escapes is a good analogy for the numeric (whether hex or dec) character entities. The entity references (like —) are a bit more like sticking `" + mdash + " into the string where mdash is a constant defined elsewhere, as one of those unicode escapes. Like all analogies, this is only accurate up to a point and overanalysis will make it less rather than more useful.
Jon Hanna
I don’t understand why this answer is still getting up votes. From your initial wrong answer on you just seem to copy parts of the other answers to keep floating on top. But, apart from being inaccurate, your answer still does not answer the question why it’s not an entity reference that is used to mark HTML line breaks.
Gumbo
A: 

Yes. An HTML entity would be more appropriate, as a break tag cannot contain text and behaves much like a newline.

That's just not the way things are, though. Too late. I can't tell you the number of non-XML-compatible HTML documents I've had to deal with because of unclosed break tags...

Borealid
At least that's an easy one to deal with; unlike unclosed nested lists and tables.
Rex M
A break tag does not behave like a newline, as it does not get ignored in the rendering. A break tag indicates a line break **in the rendered document**, which would be impossible to indicate with an entity.
Jon Hanna
+2  A: 

Entities are content, tags are structure or layout (very roughly speaking). It seems whoever made the <br> a tag decided that breaking a line has more to do with structure and layout than with content. Not being able to actually "see" a <br> I'd tend to agree. Oh and I'm making this up as I go so feel free to disagree ;)

Nicolas78
+1  A: 

br elements can be styled, though. How would you style an HTML entity? Because they're elements it makes them more flexible.

Gregory Baker
I disagree; Styling a `<br />` element is a hack; the system isn't built to accommodate hacks, hacks are built to go around the system.
LeguRi
... I would even say that this would have been a reason in favor of it being an entity as opposed to an element. Who was at that meeting saying "But what if they need a red border around the new line?" :P
LeguRi
Good point actually.
Gregory Baker
@Richard Actually the main (and almost only) use of style in the br is <br style="clear: both" />. It isn't really a hack.
HoLyVieR
@Gregory Baker: To my mind, the fact that BR tags can take styles like "clear:both" is the most compelling reason for a hard line break to be represented using tag rather than an entity. Specifying that it must be mapped to some character which an implementation should render as a newline <i>after</i> white space was eliminated would also work, though special handling would be required to deal with leading blanks on new lines (if I had my druthers, the only swallowing of white space would be of newlines which are preceded or followed by whitespace (others would become blanks))
supercat
@HoLyVieR - I've never seen a `<br style="clear: both" />` which couldn't have been done with better CSS and more effective HTML element identifiers. Personally, I consider it a hack.
LeguRi
+7  A: 
John Kugelman
  was invented because spaces were ignored but people still needed to force spaces into their texts in html (without using pre). So I think it's more than a valid question why hasn't the same happened for newline. Now there is a special 0u00A0 unicode character for   , and I think it wouldn't be a bad idea to have a similar one for newline so something like could be implemented. For the exact same reason we have  
manixrock
@manixrock, you have the details of   entirely backwards.   is an entity reference and as such takes something defined elsewhere and inserts it into the source before it is processed at a higher level. If the non-breaking space character didn't already exist, then this would never have been possible.   is useful because many people do not have a quick binding on their keyboards for non-breaking space, and because it's indistinguishable in source from space. The reason we don't have is the question of what that entity should be replaced with.
Jon Hanna
@manixrock ... indeed it has never been defined in any standard that   can't be collapsed into a single space (that would be a valid rendering behaviour), only that it can't be treated as a word-break when deciding where to wrap text. That   forces extra space is valid, and the choice made by all browsers, but not required. You can't say a standard did something to allow X when it doesn't even promise that X will happen.
Jon Hanna
+11  A: 
Jon Hanna
HTML entities may be multi-character entities; the standard just doesn't define any by default. But you're right to say that `<br>` is an indication of a *semantic* line break. (Now, if you could just rail on a bit at idiot people who think that `<br><br>` is a replacement for `<p>`, my day would be complete… ;-))
Donal Fellows
@Donal That's precisely what I meant when I said they were technically different, but as there are no multi-s defined the distinction has no impact. As for people thinking double line-breaks is the same as a paragraph, there are too many different ways in which such thinking is wrong to be able to fit complaining about that into the allowed comment space.
Jon Hanna
+1 - Now I get it :D
LeguRi
+3  A: 

In HTML all line breaks are treated as white space:

A line break is defined to be a carriage return (&#x000D;), a line feed (&#x000A;), or a carriage return/line feed pair. All line breaks constitute white space.

And white space does only separate words and sequences of white space is collapsed:

For all HTML elements except PRE, sequences of white space separate "words" (we use the term "word" here to mean "sequences of non-white space characters"). […]

[…]

Note that a sequence of white spaces between words in the source document may result in an entirely different rendered inter-word spacing (except in the case of the PRE element). In particular, user agents should collapse input white space sequences when producing output inter-word space. […]

This means that line breaks cannot be expressed by plain characters. And although there are certain special characters in Unicode to unambiguously separate lines and paragraphs, they are not specified to do this in HTML too:

Note that although &#x2028; and &#x2029; are defined in [ISO10646] to unambiguously separate lines and paragraphs, respectively, these do not constitute line breaks in HTML […]

That means there is no plain character or sequence of plain characters that is to mark a line break in HTML. And that’s why there is the BR element.

Now if you want to use &br; instead of <br>, you just need to declare the entity br to represent the value <br>:

<!ENTITY br "<br>">

Having this additional entity named br declared, a general-purpose XML or SGML processor will replace every occurrence of the entity reference &br; with the value it represents (<br>). An example document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd" [
   <!ENTITY br "<br>">
]>
<HTML>
   <HEAD>
      <TITLE>My first HTML document</TITLE>
   </HEAD>
   <BODY>
      <P>Hello &br;world!
   </BODY>
</HTML>
Gumbo
They want to stop using `<br>` entirely, so they'd have to define it as `<pre>a;</pre>`
Jon Hanna
+1  A: 

HTML is a mark-up language - it represents the structure of a document, not how that document should appear visually. Take the <EM> tag as an example - it tells user-agents that they should give emphasis to any text that is placed between the opening and closing <EM> tags. However, it does not state how that emphasis should be represented. Yes, most visual web-browsers will place the text in italics, but this is only convention. Other browsers, such as monochrome text-only browsers may display the text in inverse. A screen reader might read the text in a louder voice, or change the pronunciation. A search-engine spider might decide the text is more important than other elements.

The same goes for the <BR> tag - it isn't just another character entity, it actually represents a break in the document structure. A <BR> is not just a replacement for a newline character, but is a "semantic" part of the document and how it is structured. This is similar to the way an <H1> is not just a way of making text bigger and bolder, but is an integral part of the way the document is structured.

Dan Diplo