views:

222

answers:

2

I'm looking for this definition to make my HTML renderer conform a bit better. Currently it's guessing which whitespace to keep, which to collapse and what to throw. The SGML standard is hard to find and the HTML standard doesn't seem to treat the subject with the required depth for my needs.

Currently my renderer parses the HTML into a tree and then does a recursive layout pass to position all the elements and their content. I'm experimenting with throwing some whitespace out in the parse stage, i.e. not emitting whitespace only text chunks in certain circumstances. Which kinda works for the majority of cases, but there are a fair few edge cases that are getting hard to deal with.

(I'm also working on an editor subclass of the HTML control, and layout time solutions are proving to be a bit problem in the editor, hence me working on getting them into the parse stage. The layout information isn't available till reflow time, which is some time after you have edited the document.)

Fire away with linkage/flames.

+4  A: 

I think the section 9.1 White space in the HTML 4 specification is what you’re looking for.

Gumbo
I read that and it didn't have the detail I needed. Currently looking through the HTML 5 parser documentation to see how it covers whitespace around elements.
fret
+2  A: 

If you're writing your own HTML parser, then I strongly recommend you use the parsing algorithm in the HTML 5 spec. http://www.whatwg.org/html5 It covers a large number of edge and corner cases, and general browser weirdness. Browsers don't follow SGML rules, but they are all homing in on either doing what the HTML 5 spec says, or the functional equivalent of it. There are several open source parsers available that implement the algorithm, so it should have everything you need.

Alohci
Now that I've had some time to read through the HTML5 spec, well the parts that deal with parsing, I'm no closer to working out which whitespace characters end up being rendered and which disappear.
fret
Right. Which white space is rendered is a different question to the one you asked, which was which white space can be thrown at the parse stage. Remember that css like white-space:pre can be applied by javascript long after the parse stage, so the parse stage cannot throw away any white-space that might later be subject to such an application.
Alohci
Incidentally, a good tool for seeing how browsers actually do it, is to use Hixie's Live DOM Viewer (http://software.hixie.ch/utilities/js/live-dom-viewer/). You'll see that as you create white space in the "mark up to test" box, "#text:" nodes get created in the DOM showing that the white space is not thrown at the parse stage. Note that current/recent browsers don't behave exactly the same way, but the effect should clear enough if you're using a gecko, webkit or presto based browser.
Alohci