views:

29

answers:

1

I have to build a tag cloud out of a webpage/feed. Once you get the word frequency table of tags, it's easy to build the tagcloud. But my doubt is how do I retrieve the tags/keywords from the webpage/feed?

This is what I'm doing now:

Get the content -> strip HTML -> split them with \s\n\t(space,newline,tab) -> Keyword list

But this does not work great.

Is there a better way?

A: 

What you have is a rough 1st order approximation. I think if you then go back through the data and search for frequency of 2-word phrases, then 3 word phrases, up till the total number of words that can be considered a tag, you'll get a better representation of keyword frequency.

You can refine this rough search pattern by specifying certain words that can be contained as part of a phrase (pronouns ect).

Josiah