views:

480

answers:

6

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.

http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)

Is there any other existing algorithm to do this?

+1  A: 

When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.

hal10001
A: 

Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...

luv2lrn
+1  A: 

In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.

Andrew
+1  A: 

Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.

If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.

Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.

Manning & Schütze contains a lot more introduction on text categorization.

Torsten Marek
+4  A: 

There are existing web services for this. Two Three examples:

ceejayoz
+1  A: 

You want to do the semantic analysis of a text.

Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...

As for other algorithms they could be based on:

  • Syntax analysis (like trying to find the main subject and/or verb in a sentence)
  • Format analysis (analyzing headers, bold text, italic... where applicable)
  • Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)

BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal. The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.

Yacoder