ansaurus

Question

Answer 1

+2 A:

in short terms:

1) take the innerHTML of your body;

2) strip all punctuation and \n so you have a single line string;

3) strip all tags with a .replace() (/<[^>]*>/g);

4) strip all common words (/\band\b/g, /\bbut\b/g, ...); E.g. if your useless words are those with less than 4 chars then strip /\b[.+]{1,3}\b/

now you should have a one-line string (str) without markup and useless words

4a) Optional: if you don't care about WoRdCAse just transform all in lowercase (str.toLowerCase())

5) make a split over the blank space (str.split(' ')), you obtain an array (arr)

6)

var words = {},
        i = arr.length; 

    while(--i) {
       war extWord = arr[i];
       words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;
    }

7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)

Hope this help

Fabrizio Calderan 2010-10-11 17:00:41

Answer 2

A:

Slight modification to the option outlined by Fabrizio and using jQuery.

//grab all text from page

var myDocumentText = $("body").text();

myParseText(myDocumentText);

function myParseText(myText){

... do processing of text in here with your logic to not count and, or, etc.

}

nopuck4you 2010-10-11 17:39:32

ansaurus

tags:

views:

answers:

Javascript auto pick keywords from HTML

related questions