views:

278

answers:

2

My current project involves gathering text content from an element and all of its descendents, based on a provided selector.

For example, when supplied the selector #content and run against this HTML:

<div id="content">
  <p>This is some text.</p>
  <script type="text/javascript">
    var test = true;
  </script>
  <p>This is some more text.</p>
</div>

my script would return (after a little whitespace cleanup):

This is some text. var test = true; This is some more text.

However, I need to disregard text nodes that occur within <script> elements.

This is an excerpt of my current code (technically, it matches based on one or more provided selectors):

// get text content of all matching elements
for (x = 0; x < selectors.length; x++) { // 'selectors' is an array of CSS selectors from which to gather text content
  matches = Sizzle(selectors[x], document);
  for (y = 0; y < matches.length; y++) {
    match = matches[y];
    if (match.innerText) { // IE
      content += match.innerText + ' ';
    } else if (match.textContent) { // other browsers
      content += match.textContent + ' ';
    }
  }
}

It's a bit overly simplistic in that it just returns all text nodes within the element (and its descendants) that matches the provided selector. The solution I'm looking for would return all text nodes except for those that fall within <script> elements. It doesn't need to be especially high-performance, but I do need it to ultimately be cross-browser compatible.

I'm assuming that I'll need to somehow loop through all children of the element that matches the selector and accumulate all text nodes other than ones within <script> elements; it doesn't look like there's any way to identify JavaScript once it's already rolled into the string accumulated from all of the text nodes.

I can't use jQuery (for performance/bandwidth reasons), although you may have noticed that I do use its Sizzle selector engine, so jQuery's selector logic is available.

Thanks in advance for any help!

+2  A: 

EDIT:

Well first let me say im not too familar with Sizzle on its lonesome, jsut within libraries that use it... That said..

if i had to do this i would do something like:

var selectors = new Array('#main-content', '#side-bar');
function findText(selectors) {
    var rText = '';
    sNodes = typeof selectors = 'array' ? $(selectors.join(',')) : $(selectors);
    for(var i = 0; i <  sNodes.length; i++) {
       var nodes = $(':not(script)', sNodes[i]);
       for(var j=0; j < nodes.length; j++) {
         if(nodes[j].nodeType != 1 && node[j].childNodes.length) {
             /* recursion - this would work in jQ not sure if 
              * Sizzle takes a node as a selector you may need 
              * to tweak.
              */
             rText += findText(node[j]); 
         }  
       }
    }

    return rText;
}

I didnt test any of that but it should give you an idea. Hopefully someone else will pipe up with more direction :-)


Cant you just grab the parent node and check the nodeName in your loop... like:

if(match.parentNode.nodeName.toLowerCase() != 'script' && match.nodeName.toLowerCase() != 'script' ) {
    match = matches[y];
    if (match.innerText) { // IE
      content += match.innerText + ' ';
    } else if (match.textContent) { // other browsers
      content += match.textContent + ' ';
    }
}

ofcourse jquery supports the not() syntax in selectors so could you just do $(':not(script)')?

prodigitalson
Thanks prodigitalson - I'm not sure that this would accomplish my goal, though. I might have been a little vague in my code example (just edited it) - what it does is traverse an array of CSS selectors, and for each that matches a DOM node, it simply gets the innerText (IE) or textContent (other) property of that node. It doesn't actually loop through the elements' children. However, I think the latter is likely the best way to do this - loop through all descendents of the matched element, disregarding text nodes in <script>s - I'm just not sure what that code looks like.
Bungle
Thanks again! That looks like a good approach. Curious why you used the `Array` constructor and not bracket notation?
Bungle
@bungle: just a personal preference.
prodigitalson
Cool, thanks. I've heard bracket notation espoused as a best practice, but I'm not aware of any functional difference. Thanks again for your help.
Bungle
Im not aware of a functional difference either.
prodigitalson
+2  A: 
function getTextContentExceptScript(element) {
    var text= [];
    for (var i= 0, n= element.childNodes.length; i<n; i++) {
        var child= element.childNodes[i];
        if (child.nodeType===1 && child.tagName.toLowerCase()!=='script')
            text.push(getTextContentExceptScript(child));
        else if (child.nodeType===3)
            text.push(child.data);
    }
    return text.join('');
}

Or, if you are allowed to change the DOM to remove the <script> elements (which wouldn't usually have noticeable side effects), quicker:

var scripts= element.getElementsByTagName('script');
while (scripts.length!==0)
    scripts[0].parentNode.removeChild(scripts[0]);
return 'textContent' in element? element.textContent : element.innerText;
bobince
Awesome, thanks, bobince! I went with the first approach - you're probably right that removing `<script>` elements wouldn't typically have side effects, but I'll be using this code in the wild and don't want to risk any. I hadn't seen the use of `.data` before - I read up on it and it sounds robust. Is that cross-browser back to IE 6? Am I correct that it wouldn't pick up any text from, say, nested comment nodes - just the text content of the node itself?
Bungle
It's DOM Level 1 Core (http://www.w3.org/TR/REC-DOM-Level-1/level-one-core.html), supported by all browsers and plain XML DOMs. The code above would only look at the data in Text nodes (`3` is `Node.TEXT_NODE`, but IE fails to provide that symbolic constant). In an XML document you might also want to take the data from a `CDATA_SECTION_NODE` (`4`). A `COMMENT_NODE` is `8` and is ignored.
bobince
Ah, good to know. Thanks again for your help!
Bungle