views:

1041

answers:

4

how can i search an html page for a word fast? and how can i get the html tag that the word is in? (so i can work with the entire tag)

+4  A: 

You can iterate through DOM elements, looking for a substring within them. Neither fast nor elegant, but for small HTML might work well enough.

I'd try something recursive, like: (code not tested)

findText(node, text) {
  if(node.childNodes.length==0) {//leaf node
   if(node.textContent.indexOf(text)== -1) return [];
   return [node];
  }
  var matchingNodes = new Array();
  for(child in node.childNodes) {
    matchingNodes.concat(findText(child, text));
  }
  return matchingNodes;
}
vartec
+7  A: 

To find the element that word exists in, you'd have to traverse the entire tree looking in just the text nodes, applying the same test as above. Once you find the word in a text node, return the parent of that node.

var word = "foo",
    queue = [document.body],
    curr
;
while (curr = queue.pop()) {
    if (!curr.textContent.match(word)) continue;
    for (var i = 0; i < curr.childNodes.length; ++i) {
        switch (curr.childNodes[i].nodeType) {
            case Node.TEXT_NODE : // 3
                if (curr.childNodes[i].textContent.match(word)) {
                    console.log("Found!");
                    console.log(curr);
                    // you might want to end your search here.
                }
                break;
            case Node.ELEMENT_NODE : // 1
                queue.push(curr.childNodes[i]);
                break;
        }
    }
}

this works in Firefox, no promises for IE.

What it does is start with the body element and check to see if the word exists inside that element. If it doesn't, then that's it, and the search stops there. If it is in the body element, then it loops through all the immediate children of the body. If it finds a text node, then see if the word is in that text node. If it finds an element, then push that into the queue. Keep on going until you've either found the word or there's no more elements to search.

nickf
Hope the word being searched for isn't 'div' :P
apphacker
can you show me in script how can i get the node?
Chen Kinnrot
the innerText won't include any tag names, just the value of the text nodes, so you'll be safe there.
nickf
Heads up, @nickf: I think you forgot that the innerText property is not supported in FF and some other browsers. You might want to substitute it with 'textContent' in those cases. Still, +1 ;-)
Cerebrus
yeah, i found that out when testing this code. textContent seems to work in the same way
nickf
+1  A: 

You can try using XPath, it's fast and accurate

http://www.w3schools.com/Xpath/xpath_examples.asp

Also if XPath is a bit more complicated, then you can try any javascript library like jQuery that hides the boilerplate code and makes it easier to express about what you want found.

Also, as from IE8 and the next Firefox 3.5 , there is also Selectors API implemented. All you need to do is use CSS to express what to search for.

Azder
The OP is performing a string search in HTML content, not XML.
Cerebrus
XPath should work on the DOM too
nickf
Current browsers all support XPath which is implemented mostly for the DOM
Azder
A: 

You can probably read the body of the document tree and perform simple string tests on it fast enough without having to go far beyond that - it depends a bit on the HTML you are working with, though - how much control do you have over the pages? If you are working within a site you control, you can probably focus your search on the parts of the page likely to be different page from page, if you are working with other people's pages you've got a tougher job on your hands simply because you don't necessarily know what content you need to test against.

Again, if you are going to search the same page multiple times and your data set is large it may be worth creating some kind of index in memory, whereas if you are only going to search for a few words or use smaller documents its probably not worth the time and complexity to build that.

Probably the best thing to do is to get some sample documents that you feel will be representative and just do a whole lot of prototyping based around the approaches people have offered here.

glenatron