views:

50

answers:

2

Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just plain text)?

It should ignore common words like "and", "is" "but" etc but list the most frequent uncommon words.

Example input:

Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.

Output:

Snow (3)
White (2)
Lamb (2)

Jquery is fine!

+2  A: 

in short terms:

1) take the innerHTML of your body;

2) strip all punctuation and \n so you have a single line string;

3) strip all tags with a .replace() (/<[^>]*>/g);

4) strip all common words (/\band\b/g, /\bbut\b/g, ...); E.g. if your useless words are those with less than 4 chars then strip /\b[.+]{1,3}\b/

  • now you should have a one-line string (str) without markup and useless words

4a) Optional: if you don't care about WoRdCAse just transform all in lowercase (str.toLowerCase())

5) make a split over the blank space (str.split(' ')), you obtain an array (arr)

6)

var words = {},
        i = arr.length; 

    while(--i) {
       war extWord = arr[i];
       words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;
    }

7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)

Hope this help

Fabrizio Calderan
A: 

Slight modification to the option outlined by Fabrizio and using jQuery.

//grab all text from page

var myDocumentText = $("body").text();

myParseText(myDocumentText);

function myParseText(myText){

... do processing of text in here with your logic to not count and, or, etc.

}

nopuck4you