Is there an algorithm that extracts meaningful tags of english text

views:

480

answers:

+5 Q:

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.

http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)

Is there any other existing algorithm to do this?

+1 A:

When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.

hal10001 2008-09-15 22:54:46

Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...

luv2lrn 2008-09-15 23:02:59

+1 A:

In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.

Andrew 2008-09-15 23:03:17

+1 A:

Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.

If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.

Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.

Manning & Schütze contains a lot more introduction on text categorization.

Torsten Marek 2008-09-15 23:03:32

+4 A:

There are existing web services for this. ~~Two~~ Three examples:

ceejayoz 2008-09-15 23:06:12

+1 A:

You want to do the semantic analysis of a text.

Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...

As for other algorithms they could be based on:

Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)

BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal. The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.

Yacoder 2008-09-16 12:52:23

ansaurus

tags:

views:

answers:

Is there an algorithm that extracts meaningful tags of english text

related questions