views:

46

answers:

2

Hello,

I want to extract relevant words from a text statement provided by the user. eg. For a question "How many sides are there in a rectangle?" The words should be 'rectangles' , 'sides', 'many' , 'how'.

We've discovered that what exactly I'm aiming to do is a NLP Question answer system. But right now I want to only extract the required keywords from the question, The domain of the questions is not very vast.

I've come accross various data mining tools but not very sure if they actually will be useful for this. They seem to be a bit too advanced or not exactly related.

Please let me know if there is any tool that suits the requirement or should I go on and try coding myself.

Please provide any kind of pointers, that you think might help.

+1  A: 

If all you have is just the questions, you can try part of speech tagging (POS) and named entity extraction (NER). The nouns in particular would be of interest. There are a number of open source tools for the same, Brill's POS tager, Lingpipe, Open NLP, etc. However if you also have a corpus from the domain that you are interested in, you can extract the key words and phrases from it by using how different the frequencies of the words and phrases are as compared to some other base corpus. Given a question you can then look for those key words and phrases.

srean
LingPipe is not open source. They provide source code, but it's still proprietary. Still, good approach.
larsmans
Oh! good to know. Thanks for the correction. +1
srean
+1  A: 

Apart from srean's advice to use POS tagging and NER, many people use search engine tools (specifically Lucene, but several other exist) to do question answering. They index a set of documents that should contain the answer, use the question as a query, retrieve a set of document and filter those to find the answer. Search engine tools have built-in term weighting.

That's the baseline setup; for more advanced systems, they do all kind of preprocessing on the question and the documents, including stop word filtering, POS tagging, parsing, NER, genetic algorithms, etc.

See this paper for an example of this setup.

larsmans