views:

561

answers:

1

How to get the same results as http://developer.yahoo.com/search/content/V1/termExtraction.html

This question has been asked quite a few times before.

Trying to approach this problem with existing solutions I stumbled upon "Text Analysis" Solr performs on the document before indexing as described in http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters - which includes stemming as well.

So the final index will consist mostly of terms used to describe the document.

Is there a solution that provides analyzers, tokenizers, and token filters for direct use? If solr is the way out, what is the best way get this data from solr's index?

+2  A: 

Solr is a way to create a custom search engine. It does not seem to be the right tool for the job. The Wikipedia article about term extraction lists in its "external links" section several web applications for term extraction. OpenNLP has a list of tools which may be useful. Its Chunker may be helpful.

Yuval F
yea, Solr terms will only return the unique tokens (perhaps minus some common words, and doing stemming etc). It won't really tell you what is significant in the text. For what it's worth you can suck the terms out of solr via the http://wiki.apache.org/solr/TermsComponent
mlathe