tags:

views:

230

answers:

1

I'm looking to extract names and places from very short bursts of text example

 "cardinals vs jays in toronto"
 " Daniel Nestor and Nenad Zimonjic play Jonas Bjorkman w/ Kevin Ullyett, paris time to be announced"
"jenson button - pole position, brawn-mercedes - monaco".

This data is currently in a MySQL database, and I (pretty much) have a separate record for each athlete, though names are sometimes spelled wrong, etc.

I would like to extract the athletes and locations. I usually work in PHP, but haven't been able to find a library for entity extraction (and I may want to get deeper into some NLP and ML in the future).

From what I've found, LingPipe and NLTK seem to be the most recommended, but I can't figure out if either will really suit my purpose, or if something else would be better.

I haven't programmed in either Java or Python, so before I start learning new languages, I'm hoping to get some advice on what route I should follow, or other recommendations.

+2  A: 

What you're describing is named entity recognition. So I'd recommend checking out the other questions regarding this topic if you haven't already seen them. This looks like the most useful answer to me.

I can't really comment about whether NLTK or LingPipe is best suited for this task although from looking at the answers it looks like there's quite a few other resources written in Java.

One advantage of going with NLTK is that Python is very accessible as a language. The other advantage is that the NLTK book (which is available for free) offers an introduction to both Python and NLTK at the same time, which would be useful for you.

humble coffee