views:

74

answers:

3

I'm pretty new to NLP in general, but getting really good at Perl, and I was wondering what kind of powerful NLP modules are out there. Basically, I have a file with a bunch of paragraphs, and some of them are people's biographies. So, first I need to look for a person's name, and that helps with the rest of the process later.

So I was roughly starting with something like this:

foreach $PPid (0 .. $PPscalar) {
$paragraph = @PP[$PPid];
if ($paragraph =~ /^(\w+ \w\. \w+|\w+ \w+)( also|)( has served| served| worked| joined| currently serves| has| was| is|, )/){
    $possibleName = $1;
    $badName = 0;
    foreach $piece (@pieces){
    if ($possibleName =~ /$piece/){
        $badName = 1;
    }
    }
    if ($badName == 0){
    push @namePile, $possibleName;
    }
}

}

Because most of the names start at the beginning of the paragraphs. And then I'm looking for keywords that denote action or possession, but right now, that picks up extra junk that is not a name. There has to be a module to do this, right?

+1  A: 

Have you tried searching CPAN?

http://search.cpan.org/search?query=NLP&mode=all

I also tried searching for "Natural Language" and found the following that you might be interested in:

Lingua::EN::Tagger

Also, if you must roll your own, with regards to NLP, you want to check out Regexp::Grammars. This is the successor to Parse::RecDesent.

molecules
+4  A: 

Extracting names from data is hard. There are a variety of solutions. For named entity extraction you've got the following

  1. The naive approach. I remember looking at this and being unimpressed with the output.
  2. The dictionary approach. I've used this, but lots of false negatives, and I'm not too fond of the code underneath it.
  3. An open source binary with a perl interface (not recommended, and I'm the author of this cpan library - and setting it up is fiddly too).
  4. Best solution is the propietary web service with the Net::Calais perl wrapper

Net::Calais is by far the best bet for speed and accuracy. Go with the Stanford library if you need the underlying implementation to be open source.

singingfish
I found the Standford Java package while searching. I managed to get it set up, and I did have to fiddle quite a bit, but I got the server running and returning a string of marked entities. However, I could never get the list_entities and entities_list methods to work, they always returned empty arrays. Otherwise, it worked great.If you're the author, that's awesome! I'm working on a solution from a different angle right now, but I'm going to try the Stanford package some time later, would you be able to help me out with it?
Sho Minamimoto
I'm not the author, but I did set up the Perl interface. If you want maintaiership on the CPAN module, please email my CPAN address :). For my purposes Net::Calais serves me better (unfortunately), so I doubt I'll be doing further work on this in the forseeable future.
singingfish
A: 

I don't know of any Perl modules which do processing of English in order to break it into parts of speech. I expect there are libraries out there which do that, in C or C++ or something, so if you don't find a good answer, maybe you can broaden your search.

One easy hack is to check for two words which are both capitalized:

if (/[A-Z][a-z]+\s+[A-Z][a-z]/) { ...

or check for titles:

if (/(?:Mr|Mrs|Ms|Dr)\.?\s+[A-Z][a-z]+/) { ...
Kinopiko
Lingua::EN::Tagger was already mentioned as a perl module which does processing of english in order to break it into parts of speech.
singingfish
@singingfish: That's not a good reason to downvote my post.
Kinopiko
@Kinopiko The naive approach (Lingua::EN::NamedEntity) listed above does the same as what you suggest badly. I downvoted the post because of the implied claim that there may not be english POS taggers for Perl.
singingfish