Your favorite natural language parser?

views:

2450

answers:

+26 Q:

Your favorite natural language parser?

This is just a poll on what parser you like to use for parsing sentences of natural language syntactically. I am interested in complete software toolkits/solutions. A good answer would list at least some of the following:

The name of the parser (obviously) and a link to its webpage.
The (programming!) language(s) it's written in.
The (human) language(s) it's written for.
The project's license.
The underlying grammar formalism (HPSG, CCG, (L)TAG ...).
The project's status (is it in active development, how stable/usable is it).
The parsing depth (shallow, deep, wide-coverage ...).
The underlying algorithm (bottom-up, top-down, Tomita, Earley, left corner (LRN, which N?)).
Tools it relies on (specific taggers/tagsets, tokenizer/chunker or even lemmatizer).

I don't think I'd accept a definitive "answer" to this question. Just go ahead and enumerate your favorite solutions.

Thanks for your time :-)

+8 A:

I don't know if it's my favorite since I don't use all the natural language parsing features, but it is useful for a wide range of linguistic tasks:

NLTK, the Natural Language Tool Kit

python
easy to use
stable, active development (.95 released 28 August 2008)
outstanding documentation, including an eBook on various computational linguistic topics
free in cost, GPL (code), Creative Commons Variant (text)
stop lists and corpora in a few European languages
variety of tokenizers
Porter stemmers, chunkers, first order bayesian models, and interfaces to WordNet.

I personally found that many of the tokenizers and lexers were too slow for my applications, and simpler regex based ones were just fine for my apps, so YMMV.

Gregg Lind 2008-09-18 01:09:47

+7 A:

The parser I use the most in my research is probably the HPSG parser PET. Going through your list:

PET – a platform for experimentation with efficient HPSG processing techniques.
Implemented in C++ (with some functionality in Lisp).
Compatible grammars for English, Japanese, German, Greek, French, Korean, and more. See here for more information.
GPL licensed.
Head-driven Phrase Structure Grammar. Specifically, the Delph-In/Matrix formalism.
Deep, but more support for integration with shallow parsing techniques is being added. Provides semantic analysis in MRS logical form.
Under active development. Packages are available for Ubuntu Linux at Ubuntu-NLP. Users of other operating systems should see here.
Bottom-up unification-based chart parsing.
No dependencies on external tools. There is some integration with POS taggers in English and Japanese, however.

See here for a demo of English parsing.

underspecified 2008-09-18 09:54:59

+6 A:

I just found a new parser, written by Dan Bikel.

The name is just "Parsing Engine", to be found here.
It's written in Java, using the Penn Treebank Tagset
Pre-trained data exists for English, Chinese and Arabic. Korean is announced as "Coming soon"
The license is strange. Non-commercial for educational purposes, but you are not allowed to redistribute, and you have to agree to some sort of EULA.
It's a statistical CYK parser, using "language-packs" to ease parsing of different languages.
Active development, the guy just had a 1.0 release
The parser seems to be pretty wide-coverage/deep parsing oriented.
Comes with its own tagger, but can also read the Penn Treebank tagset in the typical Penn Lisp/oneliner format.

The whole thing is based on a client/server model. There is a class called Switchboard, that serves as a hypervisor. Clients and Servers can register with the switchboard. A Server is typically a DecoderServer, which serves as the actual parser. A client will typically be a Parser, which is nothing more than a relay to the DecoderServer. The Parser class can read files and parse their input format, convert it to something the DecoderServer understands, and ultimately send the result its way.

While the whole model sounds nice in theory, in practice I found it to be hardly usable, if at all. It's largely undocumented, and the only intended use-case seems to be running that thing from command line. Implementing it as a backend for an applet in a Tomcat Servlet, I had great trouble getting it going (for example, Bikel suffers from the same resource-handling-dementia as everybody else in NLP, in that the API wants a String denoting the location of the model file. Internally, the class then opens an ObjectInputStream to the location and reads the serialized model from that. How stupid is this? Why not take an ObjIS directly? In a secure Tomcat environment, you can only access stuff via jndi, which makes this approach useless.)

But as far as parsers go, this is a good one. Pretty fast, pretty stable, and with a rather nice coverage.

Aleksandar Dimitrov 2008-09-22 15:46:12

What kind of deep parsing does it support?

underspecified 2008-09-23 15:33:34

There's a paper about it - I'm currently printing it. As soon as I've read it, I'll fill you up on the details.

Aleksandar Dimitrov 2008-09-24 10:34:31

OK, I've added some infos now :-)

Aleksandar Dimitrov 2008-10-18 13:34:53

+5 A:

The Curran & Clark Parser (available here: http://svn.ask.it.usyd.edu.au/trac/candc/wiki) parsers words into CCG categories based on the contexts around it to create compositionally well-formed sentences using four basic rules of satisfying non-basic types.

Theoretically speaking, it's extremely computationally sound (nothing Turing complete, rare departures from context-free), amenable to the addition of the kinds of features found in unification grammar, and extremely fast.

For something with a little bit more semantics, look into OpenCCG, which uses a more fully fleshed out grammmar, but requires a bit more prepping.

Robert Elwell 2008-10-10 22:35:36

+6 A:

I found suitable for my self Berkely Parser, which can be found here

It's written on JAVA respectively it's platform independent
It is a statistical NL parser and could process any language. So far resources are available for Chinese, English, French and German and I'm using it for Bulgarian
It's distributed under GNU General Public License
It accepts Penn format annotated data (in Setence per line fashion).
It performes deep parsing
It doesn't rely on any additional tools like sentence splitter, POS tagger etc...

2009-03-02 12:35:33

maybe remove those spacings so it becomes a list instead of code? =)

Svish 2009-03-02 12:42:42

I can't find any in depth documentation for this.

Rosarch 2009-11-24 22:19:10

@Rosarch: There is not much but the code is reasonably easy to read.

Nathan Sanders 2010-10-01 00:35:43

+4 A:

The Stanford NLP, http://nlp.stanford.edu/software/lex-parser.shtml

Java

English, Arabic, Chinese, German

GNU GPL v2 (Non profit only)

Statistical Part of Speech.

Under development, there is a stable release as of September 09, I believe.

The parsing depth (shallow, deep, wide-coverage ...).

The underlying algorithm (bottom-up, top-down, Tomita, Earley, left corner (LRN, which N?>)).

Tools it relies on (specific taggers/tagsets, tokenizer/chunker or even lemmatizer).

Don't know these, I'm a pretty light user. :-p

piggles 2010-01-10 08:10:13

It's a deep parser, which does PCFG parsing (Penn Treebank) and can convert its PCFG parses into dependency parses. It doesn't rely on any other tools, it can do it's own tokenization.

Ken Bloom 2010-03-08 15:17:44

@Ken: Thanks :)

piggles 2010-03-08 16:38:45

I'm currently looking the Enju parser. I have used stanford parser, berkeley and PET.

The below observation is based on a sample dataset of sentences from a few websites (cleaned). While stanford had a measure-able degree of errors, berkeley's parsing was better the tagging and parsing still lacked accuracy. PET was by far the best parser compared to the previous two. But it is difficult to set up and use.

So my search for a better parser has currently led me to Enju parser. It is also a probabilistic HPSG parser which gives both phrase structures and predicate-argument structures. We can convert the xml output to PTB tag format

Enju Parser
C++
English, tained on penn, brown and genia
Free for research purpose.
HPSG
active development, stable as far as I have used.
deep parser
supertagger, up - a parser for unification-based grammars. (packaged with the parser)

Sharmila 2010-10-14 18:27:18

ansaurus

tags:

views:

answers:

Your favorite natural language parser?

related questions