views:

1546

answers:

3

Hello!

I would like to use named entity recognition (NER) to find adequate tags for texts in a database.

I know there is a Wikipedia article about this and lots of other pages describing NER, I would preferably hear something about this topic from you:

  • What experiences did you make with the various algorithms?
  • Which algorithm would you recommend?
  • Which algorithm is the easiest to implement (PHP/Python)?
  • How to the algorithms work? Is manual training necessary?

Example:

"Last year, I was in London where I saw Barack Obama." => Tags: London, Barack Obama

I hope you can help me. Thank you very much in advance!

A: 

I don't really know about NER, but judging from that example, you could make an algorithm that searched for capital letters in the words or something like that. For that I would recommend regex as the most easy to implement solution if you're thinking small.

Another option is to compare the texts with a database, wich yould match string pre-identified as Tags of interest.

my 5 cents.

v3ga
This doesn't work. First, it only works in *correct* English texts. In addition to that, it doesn't work if there's no case sensitivity.
+4  A: 

To start with check out http://www.nltk.org/ if you plan working with python although as far as I know the code isn't "industrial strength" but it will get you started.

Check out section 7.5 from http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html but to understand the algorithms you probably will have to read through a lot of the book.

Also check this out http://nlp.stanford.edu/software/CRF-NER.shtml. It's done with java,

NER isn't and easy subject and probably nobody will tell you "this is the best algorithm", most of the have their pro/cons.

My 0.05 of a dollar.

Cheers,

Ale
+1 for suggesting nltk
pufferfish
NLTK sounds good but it requires installation via shell, doesn't it? I can't install anything via shell.
What do you mean by installation via shell? Check out http://www.nltk.org/download, it's enough if you just add nltk to your PYTHONPATH.
Ale
+1  A: 

It depends on whether you want:

To learn about NER: An excellent place to start is with NLTK, and the associated book.

To implement the best solution: Here you're going to need to look for the state of the art. Have a look at publications in TREC. A more specialised meeting is Biocreative (a good example of NER applied to a narrow field).

To implement the easiest solution: In this case you basically just want to do simple tagging, and pull out the words tagged as nouns. You could use a tagger from nltk, or even just look up each word in PyWordnet and tag it with the most common wordsense.


Most algorithms required some sort of training, and perform best when they're trained on content that represents what you're going to be asking it to tag.

pufferfish
I think even the easiest solution would need to do some n-gram analysis to try to find multiword entities.
Triptych
http://osteele.com/projects/pywordnet/ says "This is the old version of PyWordNet. PyWordNet was contributed to the NLTK project in 2006."
dfrankow
@Triptych: You'll find lots of n-grams that are "I love" and "of which"
dfrankow