tags:

views:

51

answers:

4

How do you/is there a way to figure out if a block of HTML contains zero text nodes?

e.g. this:

<p><div><span></span></div></p>

contains zero text nodes whereas this:

<p>Stuff</p><div><span>other stuff</span></div>

contains two.

Additionally, you're guaranteed that the HTML is XHTML-compliant and the content is probably less than 4k in size. I'm using .net, so if some kind of server-side suggestion is made, please make it in C#. I suppose I could load the thing into an XmlDocument and traverse the DOM tree looking for non-empty XmlText nodes, but that would be a last resort as speed is of paramount concern.

+1  A: 

I would do exactly what you suggest - look for non-empty XMLText nodes. Remember that XML does not have a consistent lexical form (e.g. quoting, whitespace, CDATA all cause problems). Until you have tried it do you know that using DOM will be a performance hit?

UPDATE You don't have to use XmlDocument (or XDocument). There are many tools that will address this problem. I'd look at things like StAX http://en.wikipedia.org/wiki/StAX - streaming XML parser - where you can quite as soon as you hit a non-empty text node. The XML community has put a lot of work into optimising performance. You may find that Saxon (http://saxon.sourceforge.net/) or libxml2 (http://xmlsoft.org/) has what you need. "Programming with libxml2 is like the thrilling embrace of an exotic stranger." Mark Pilgrim.

In any case if you ask your question on XML-DEV (http://www.xml.org/xml-dev/ - feel free to mention I suggested it) then I'd be disappointed if you didn't get clear and useful suggestions.

peter.murray.rust
Unfortunately, I've used the XmlDocument object enough to know that it's not my first choice. However, it may be my only choice.
Robert C. Barth
+1  A: 

Given a certain block of HTML, you could always strip away everything that is within <, >, and everything that is whitespace, and see if the remaining string is empty. That approach would work in any language that handles regular expressions, but here's an example in javascript:

var isEmpty = someNode.innerHTML.replace(/<[^>]+>/g, "").replace(/\s/g, "") == ""
David Hedlund
regex for html is EEEvil! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
RCIX
that *is* true. it's still how i'd go about addressing this issue, tho, if i ever encountered this specific requirement, as it does by no means require that the expression understand anything of what's going on. it's true that it's vulnerable to breakage by unescaped less-than-signs intended for less-than use, followed at some point by greater-than-signs, but then *so is html itself*
David Hedlund
A: 

If I'm not mistaken, you should be able to use the innerText property (in Internet Explorer, there is an equivalent in other browsers, I can't remember the name) and just compare it to an empty string.

On second thought, this property may strip out whitespace, but its worth a shot.

LorenVS
innerText is an IE-only property; there is no corss-browser equivalent.
Robert C. Barth
A: 

Here's why not to use regexes.

The following HTML passes HTML 4.01 validation.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd"&gt;
<title>demo</title>
<div><p class=">" ></div>

If someNode is the div, d's regex will fail. If the regex cannot cope with even simple valid HTML, what chance does it have with invalid markup?

Alohci
I understand your point, but the possibility of this happening is near-zero in my circumstance; the XHTML is generated by a tool (TinyMCE) and the user may not edit it.
Robert C. Barth