tags:

views:

62

answers:

3

hi all, I'm checking out the HTML rendering of a page: http://gothamist.com/2010/07/18/wikileaks_founder_no-show_at_nyc_ha.php

if you look at this image you can see when I look at the DOM there are odd character breaks with quotes by " As a commenter "

http://img153.imageshack.us/f/screenshot20100730at840.png/

Any idea what those are and how I'd strip them out of the DOM to have clean, continuous text?

thanks!

A: 

Those aren't really elements, but text nodes, as they should be. HTML elements contain text nodes.

<p>text</p>

The paragraph element doesn't hold an element, it holds a text node.

One thing I noticed, though, is that you have invalid markup and due to this, the DOM tree within Firefox is inconsistent with Chrome.

That text node for "As a commentor" should be a child of the paragraph, but the invalidness of the span ( owned by the paragraph ) containing the div in Chrome is making it close the p so the text node becomes a sibling. As the HTML parsers creates the tree, it reaches the <div> and realizes that it's already within a p and span and a span can't contain a div so it closes the p and creates a new element, the div.

Firefox's DOM tree is lenient and actually allows the nesting to go on. This is the cause of the inconsistency of the placement of the text node which you're referring to.

Basically you have this:

<p><span><div>blah</div></span>As a commentor</p> 

Chrome turns it into

<p><span></span></p><div>blah</div>As a commentor

Firefox lets it get away with it

<p><span><div>blah</div></span>As a commentor</p> 

Solution: validate your HTML and don't let the span contain the div:

http://validator.w3.org/check?uri=http://gothamist.com/2010/07/18/wikileaks_founder_no-show_at_nyc_ha.php&amp;charset=(detect+automatically)&amp;doctype=Inline&amp;group=0

After you properly mark it up, you'll see that the text node should live inside the p.

meder
A: 

It's probably your browser. No such thing in FireBug.

XQYZ
+2  A: 

This is just how WebKit inspector denotes a TextNode

You are seeing more than one textnode surround anchor tags.

If you dump childNodes for that div, it looks like this:

0: Text
1: HTMLParagraphElement
2: HTMLDivElement
3: Text
4: Text
5: HTMLAnchorElement
6: Text
7: HTMLAnchorElement
8: Text
9: HTMLParagraphElement
10: Text
11: HTMLParagraphElement
12: Text
13: HTMLParagraphElement
14: Text

In side the element inspector, those nodes marked as Text will be surrounded with quotes. This is just a feature of the element inspector.

Any idea what those are and how I'd strip them out of the DOM to have clean, continuous text?

Some browsers support innerText

For example, run this on that site:

document.querySelector('.asset-body').innerText

Matt