views:

1203

answers:

1

Hello there, i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results. I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website. What do I have : A list of articles returned from a search. Each article has an ID and an abstract. The idea is to get keywords from each abstract text. And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search. Any ideas ? I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc... I just need to know the best aproahc to resolve this problem. Thanks a lot.

+1  A: 

i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml

However, if you are expecting a lot of multi-word terms in your corpus.. instead of extracting just nouns, you could take the most prolific n-grams for n=2 to 4

adi92
Could you tell me about the models in POStaggers ?What are they ? How can do I train a POStagger ?Do I have to update the training from time to times ?Where do I get the models ?
Kirill
i have used their POS tagger a few months back.. you don't have to train anything.. they provide default models which are pretty good..this models basically specify which words should be labelled with which parts of speech.. u shud start of by downloading it nd following the README instructions to get some sample outputi am not sure but i think the tags it uses are the 'word level' tags at http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
adi92
adi92