



Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just plain text)?

It should ignore common words like "and", "is" "but" etc but list the most frequent uncommon words.

Example input:

Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.


Snow (3)
White (2)
Lamb (2)

Jquery is fine!

+2  A: 

in short terms:

1) take the innerHTML of your body;

2) strip all punctuation and \n so you have a single line string;

3) strip all tags with a .replace() (/<[^>]*>/g);

4) strip all common words (/\band\b/g, /\bbut\b/g, ...); E.g. if your useless words are those with less than 4 chars then strip /\b[.+]{1,3}\b/

  • now you should have a one-line string (str) without markup and useless words

4a) Optional: if you don't care about WoRdCAse just transform all in lowercase (str.toLowerCase())

5) make a split over the blank space (str.split(' ')), you obtain an array (arr)


var words = {},
        i = arr.length; 

    while(--i) {
       war extWord = arr[i];
       words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;

7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)

Hope this help

Fabrizio Calderan

Slight modification to the option outlined by Fabrizio and using jQuery.

//grab all text from page

var myDocumentText = $("body").text();


function myParseText(myText){

... do processing of text in here with your logic to not count and, or, etc.

