ansaurus

Question

How To Figure Out if an HTML Block Does Not Contain Any Text Nodes

Answer 1

+1 A:

I would do exactly what you suggest - look for non-empty XMLText nodes. Remember that XML does not have a consistent lexical form (e.g. quoting, whitespace, CDATA all cause problems). Until you have tried it do you know that using DOM will be a performance hit?

UPDATE You don't have to use XmlDocument (or XDocument). There are many tools that will address this problem. I'd look at things like StAX http://en.wikipedia.org/wiki/StAX - streaming XML parser - where you can quite as soon as you hit a non-empty text node. The XML community has put a lot of work into optimising performance. You may find that Saxon (http://saxon.sourceforge.net/) or libxml2 (http://xmlsoft.org/) has what you need. "Programming with libxml2 is like the thrilling embrace of an exotic stranger." Mark Pilgrim.

In any case if you ask your question on XML-DEV (http://www.xml.org/xml-dev/ - feel free to mention I suggested it) then I'd be disappointed if you didn't get clear and useful suggestions.

peter.murray.rust 2009-11-25 07:42:23

Unfortunately, I've used the XmlDocument object enough to know that it's not my first choice. However, it may be my only choice.

Robert C. Barth 2009-11-26 04:59:21

Answer 2

+1 A:

Given a certain block of HTML, you could always strip away everything that is within <, >, and everything that is whitespace, and see if the remaining string is empty. That approach would work in any language that handles regular expressions, but here's an example in javascript:

var isEmpty = someNode.innerHTML.replace(/<[^>]+>/g, "").replace(/\s/g, "") == ""

David Hedlund 2009-11-25 07:45:03

regex for html is EEEvil! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

RCIX 2009-11-25 07:53:28

that *is* true. it's still how i'd go about addressing this issue, tho, if i ever encountered this specific requirement, as it does by no means require that the expression understand anything of what's going on. it's true that it's vulnerable to breakage by unescaped less-than-signs intended for less-than use, followed at some point by greater-than-signs, but then *so is html itself*

David Hedlund 2009-11-25 08:00:27

Answer 3

A:

If I'm not mistaken, you should be able to use the innerText property (in Internet Explorer, there is an equivalent in other browsers, I can't remember the name) and just compare it to an empty string.

On second thought, this property may strip out whitespace, but its worth a shot.

LorenVS 2009-11-25 08:14:45

innerText is an IE-only property; there is no corss-browser equivalent.

Robert C. Barth 2009-11-26 04:59:55

Answer 4

A:

Here's why not to use regexes.

The following HTML passes HTML 4.01 validation.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd"&gt;
<title>demo</title>
<div><p class=">" ></div>

If someNode is the div, d's regex will fail. If the regex cannot cope with even simple valid HTML, what chance does it have with invalid markup?

Alohci 2009-11-25 10:15:56

I understand your point, but the possibility of this happening is near-zero in my circumstance; the XHTML is generated by a tool (TinyMCE) and the user may not edit it.

Robert C. Barth 2009-11-26 05:01:26

ansaurus

tags:

views:

answers:

How To Figure Out if an HTML Block Does Not Contain Any Text Nodes

related questions