ansaurus

Question

Perl and NLP, parse Names out of Biographies

Answer 1

+1 A:

Have you tried searching CPAN?

http://search.cpan.org/search?query=NLP&mode=all

I also tried searching for "Natural Language" and found the following that you might be interested in:

Lingua::EN::Tagger

Also, if you must roll your own, with regards to NLP, you want to check out Regexp::Grammars. This is the successor to Parse::RecDesent.

molecules 2010-07-15 19:13:30

Answer 2

+4 A:

Extracting names from data is hard. There are a variety of solutions. For named entity extraction you've got the following

The naive approach. I remember looking at this and being unimpressed with the output.
The dictionary approach. I've used this, but lots of false negatives, and I'm not too fond of the code underneath it.
An open source binary with a perl interface (not recommended, and I'm the author of this cpan library - and setting it up is fiddly too).
Best solution is the propietary web service with the Net::Calais perl wrapper

Net::Calais is by far the best bet for speed and accuracy. Go with the Stanford library if you need the underlying implementation to be open source.

singingfish 2010-07-16 05:48:44

I found the Standford Java package while searching. I managed to get it set up, and I did have to fiddle quite a bit, but I got the server running and returning a string of marked entities. However, I could never get the list_entities and entities_list methods to work, they always returned empty arrays. Otherwise, it worked great.If you're the author, that's awesome! I'm working on a solution from a different angle right now, but I'm going to try the Stanford package some time later, would you be able to help me out with it?

Sho Minamimoto 2010-07-22 20:01:18

I'm not the author, but I did set up the Perl interface. If you want maintaiership on the CPAN module, please email my CPAN address :). For my purposes Net::Calais serves me better (unfortunately), so I doubt I'll be doing further work on this in the forseeable future.

singingfish 2010-07-23 01:25:38

Answer 3

A:

I don't know of any Perl modules which do processing of English in order to break it into parts of speech. I expect there are libraries out there which do that, in C or C++ or something, so if you don't find a good answer, maybe you can broaden your search.

One easy hack is to check for two words which are both capitalized:

if (/[A-Z][a-z]+\s+[A-Z][a-z]/) { ...

or check for titles:

if (/(?:Mr|Mrs|Ms|Dr)\.?\s+[A-Z][a-z]+/) { ...

Kinopiko 2010-07-16 06:25:44

Lingua::EN::Tagger was already mentioned as a perl module which does processing of english in order to break it into parts of speech.

singingfish 2010-07-16 07:37:54

@singingfish: That's not a good reason to downvote my post.

Kinopiko 2010-07-16 08:11:20

@Kinopiko The naive approach (Lingua::EN::NamedEntity) listed above does the same as what you suggest badly. I downvoted the post because of the implied claim that there may not be english POS taggers for Perl.

singingfish 2010-07-16 22:58:54

ansaurus

tags:

views:

answers:

Perl and NLP, parse Names out of Biographies

related questions