Term extraction: Generatings tags out of text | ansaurus

tags:

views:

561

answers:

1

+2 Q:

Term extraction: Generatings tags out of text

How to get the same results as http://developer.yahoo.com/search/content/V1/termExtraction.html

This question has been asked quite a few times before.

Trying to approach this problem with existing solutions I stumbled upon "Text Analysis" Solr performs on the document before indexing as described in http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters - which includes stemming as well.

So the final index will consist mostly of terms used to describe the document.

Is there a solution that provides analyzers, tokenizers, and token filters for direct use? If solr is the way out, what is the best way get this data from solr's index?

+2 A:

Solr is a way to create a custom search engine. It does not seem to be the right tool for the job. The Wikipedia article about term extraction lists in its "external links" section several web applications for term extraction. OpenNLP has a list of tools which may be useful. Its Chunker may be helpful.

Yuval F 2009-07-09 09:04:15

yea, Solr terms will only return the unique tokens (perhaps minus some common words, and doing stemming etc). It won't really tell you what is significant in the text. For what it's worth you can suck the terms out of solr via the http://wiki.apache.org/solr/TermsComponent

mlathe 2010-01-28 18:49:38

related questions

BNF grammar test case generation

Print stack trace information from C#

What is a good way to format logs?

How do you parse a filename in bash?

How to parse a string into a nullable int in C# (.NET 3.5)

An easy way to diff log files, ignoring the time stamps?

Learning Resources on Parsers, Interpreters, and Compilers

Does C# have built-in support for parsing page-number strings?

Resources for lexing, tokenising and parsing in python

Parsing, where can I learn about it.

Parsing XML using unix terminal

Equation (expression) parser with precedence?

What HTML parsing libraries do you recommend in Java

Where do I get the Antlr Ant task?

How do I put unicode characters in my Antlr grammar?

Resolving reduce/reduce conflict in yacc/ocamlyacc

Best Approach to Parse for SQL in PHP Files?

.Net Parse verses Convert

How can I learn about parser combinators?

Parse usable Street Address, City, State, Zip from a string

C# Save Dialogs

Delimited string parsing framework for .NET

Looking for algorithm that reverses the sprintf() function output

Split a string ignoring quoted sections

What is the best way to parse strings in Java