tags:

views:

31

answers:

1

I want to be able to generate tag clouds from free text that comes from any number of different sources. For clarity, I'm not talking about how to display a tag cloud once the critical tags/phrases are already discovered, I'm hoping to be able to discover the meaningful phrases themselves... preferable on a PHP/MySQL stack.

If I had to do this myself, I'd start by establishing some kind of index for words/phrases that gives a "normal" frequency for any word/phrase. eg "Constantinople" occurs once in every 1,000,000 words on average (normal frequency "0.000001"). Then as I analyze a body of text, I'd find the individual words/phrases (another challenge!), find frequencies of each within the input, and measure against the expected freqeuncy. Words that have the highest ratio against expected frequency get boosted priority in the cloud.

I'd like to believe someone else has already done this, WAY better than I could hope to, but I'll be damned if I can find it.

Any recommendations??

+1  A: 

You need an inverted index, used by full-text search engines. A text search library like Lucene or Xapian should help, many such libraries have PHP bindings.

Stuart Sierra