ansaurus

Question

Answer 1

+2 A:

I would select all the text nodes in the document, iterate through them (splitting the contents on whitespace), and increment a counter for each word encountered. Use a hash of keyword/count to speed up the keyword lookup for increment.

var keywords = new Hash();  // from prototype or use your own

function traverseNode( node ) {
    if (node.nodeName == '#text') {
       indexNode( node );
    }
    else {
       for (var i = 0, len node.ChildNodes.length; i < len; ++i) {
           traverse( node.childNodes[i] );
       }
    }
}

function indexNode( node ) {
    var words = node.NodeValue.split( /\s/ );
    for (var i = 0, len = words.length; i < len; ++i) {
        if (keywords.hasItem( words[i]) ) {
           keywords.setItem( words[i], keywords.getItem(words[i]) + 1 );
        }
        else {
           keywords.setItem( words[i], 1 );
        }
    }
}

traverseNode( document.body );

tvanfosson 2008-12-07 23:22:38

This is great. I hadn't been able to find a way to identify text nodes. #text does the job.

Mat 2008-12-08 22:19:21

Answer 2

+1 A:

An alternative to traversing the DOM manually is to use textContent instead of innerHTML. The downside is you can't filter out script or other elements you might not want to search.

Either way, I would split the text into words like @tvanfosson's answer, although you may need to split around something besides just whitespace depending on how you define a word.

Matthew Crumley 2008-12-07 23:48:15

Answer 3

+2 A:

I'm not sure if it is the fastest but the following worked pretty quickly for me.

var words = document.body.innerHTML.replace(/<.*?>/g,'').split(/\s+/);
var i = words.length;
var keywordCounts = {'keyword': 0, 'javascript': 0, 'today': 0};
var keywords = [];
var keywordMatcher = '';
var word;
for (word in keywordCounts) {
    keywords[keywords.length] = word ;
    keywordMatcher = keywordMatcher + '(' + word + ')?';
}
var regex = new RegExp(keywordMatcher);
var j = keywords.length;
var matched, keyword;
if (i && j) {
    do {
        i = i - 1;
        matched = words[i].match(regex);
        if (!matched) continue;
        j = keywords.length;
        do {
            j = j - 1;
            if (matched[j + 1]) {
                keyword = keywords[j];
                keywordCounts[keyword] = keywordCounts[keyword] + 1;
            }
        } while (j);
    } while (i);
}

I'll definitely grant that from a Big(O) perspective it isn't the best because as i and j get big it still requires n squared time but I've found regular expression processing to generally be pretty fast.

Basically I'm taking tvanfosson's idea and expanding on it, but rather than traversing the DOM I'm removing the tags with a regex (the first line) and then splitting the page into individual words. The keyword 'hash' is defined on the third line with initial counts (they should all start at zero obviously). From there I a new regular expression is constructed using each keyword as a group so when matched it returns an array of results that has (in my example) [fullMatch,keywordMatch,javascriptMatch,todayMatch]. I'm using decrementing do while loops because they've been shown in lots of places to be the fastest looping structure in JavaScript and since it doesn't matter in what order the words get processed loop speed is really the only consideration.

I hope this is helpful, if not it was at least a fun exercise. :)

Bryan 2008-12-08 00:55:33

If the keyword is "java" and the text is "javascript is not java" you will get a count of 2 instead of the expected 1.

some 2008-12-09 18:08:56

I did some profiling with this solution and the one I provided. With a text of 5700 words and 1 keyword, this took almost twice the time. With 86 keywords it was almost 20 times slower.

some 2008-12-09 18:21:59

Answer 4

+3 A:

I could not find hasItem, setItem or getItem in prototypes Hash like tvanfosson suggested, but used set and get and wrote a hasItem based on get. But profiling showed that it is slower to use prototypes Hash compared to javascripts native object.

If you have an array with keywords, convert it to an hash object with the keywords as the key and a value of 0:

function prepareCount(words) {
    var result = {};
    for (var i=0,len=words.length; i < len; i++) {
     result[words[i]] = 0;
    }
    return result;
}

Instead of splitting the string and go through it with a for statement, you could use a function as a parameter to replace. In the tests I did this was much faster. In the regexp I choosed to match everything but white space. You probably want to add other separators like parentheses, comma, dot and dash and so on, or if you know the text is ASCII only, you can use a-z instead.

function countKeywords(text,wordcount) {
    text.replace(/[^\s]+/g,function(s) {
    if (wordcount[s]!==undefined) { ++wordcount[s];}
        return "";
    });
    return wordcount;
}

To use it:

var wordcount = countKeywords(document.documentElement.textContent.toLowerCase(),prepareCount(["my","key","words"]));

Update:

Use this regexp to exclude all delimiters in ASCII but underscore (allows non ASCII characters):

/[^\s\x00-\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\x7F]+/g

if you know that your text with keywords are ASCII only, you can instead use: /[a-z]+

some 2008-12-08 07:18:51

ansaurus

tags:

views:

answers:

Fastest javascript page search

related questions