views:

2450

answers:

7

This is just a poll on what parser you like to use for parsing sentences of natural language syntactically. I am interested in complete software toolkits/solutions. A good answer would list at least some of the following:

  • The name of the parser (obviously) and a link to its webpage.
  • The (programming!) language(s) it's written in.
  • The (human) language(s) it's written for.
  • The project's license.
  • The underlying grammar formalism (HPSG, CCG, (L)TAG ...).
  • The project's status (is it in active development, how stable/usable is it).
  • The parsing depth (shallow, deep, wide-coverage ...).
  • The underlying algorithm (bottom-up, top-down, Tomita, Earley, left corner (LRN, which N?)).
  • Tools it relies on (specific taggers/tagsets, tokenizer/chunker or even lemmatizer).

I don't think I'd accept a definitive "answer" to this question. Just go ahead and enumerate your favorite solutions.

Thanks for your time :-)

+8  A: 

I don't know if it's my favorite since I don't use all the natural language parsing features, but it is useful for a wide range of linguistic tasks:

NLTK, the Natural Language Tool Kit

  • python
  • easy to use
  • stable, active development (.95 released 28 August 2008)
  • outstanding documentation, including an eBook on various computational linguistic topics
  • free in cost, GPL (code), Creative Commons Variant (text)
  • stop lists and corpora in a few European languages
  • variety of tokenizers
  • Porter stemmers, chunkers, first order bayesian models, and interfaces to WordNet.

I personally found that many of the tokenizers and lexers were too slow for my applications, and simpler regex based ones were just fine for my apps, so YMMV.

Gregg Lind
+7  A: 

The parser I use the most in my research is probably the HPSG parser PET. Going through your list:

  • PET – a platform for experimentation with efficient HPSG processing techniques.
  • Implemented in C++ (with some functionality in Lisp).
  • Compatible grammars for English, Japanese, German, Greek, French, Korean, and more. See here for more information.
  • GPL licensed.
  • Head-driven Phrase Structure Grammar. Specifically, the Delph-In/Matrix formalism.
  • Deep, but more support for integration with shallow parsing techniques is being added. Provides semantic analysis in MRS logical form.
  • Under active development. Packages are available for Ubuntu Linux at Ubuntu-NLP. Users of other operating systems should see here.
  • Bottom-up unification-based chart parsing.
  • No dependencies on external tools. There is some integration with POS taggers in English and Japanese, however.

See here for a demo of English parsing.

underspecified
+6  A: 

I just found a new parser, written by Dan Bikel.

  • The name is just "Parsing Engine", to be found here.
  • It's written in Java, using the Penn Treebank Tagset
  • Pre-trained data exists for English, Chinese and Arabic. Korean is announced as "Coming soon"
  • The license is strange. Non-commercial for educational purposes, but you are not allowed to redistribute, and you have to agree to some sort of EULA.
  • It's a statistical CYK parser, using "language-packs" to ease parsing of different languages.
  • Active development, the guy just had a 1.0 release
  • The parser seems to be pretty wide-coverage/deep parsing oriented.
  • Comes with its own tagger, but can also read the Penn Treebank tagset in the typical Penn Lisp/oneliner format.

The whole thing is based on a client/server model. There is a class called Switchboard, that serves as a hypervisor. Clients and Servers can register with the switchboard. A Server is typically a DecoderServer, which serves as the actual parser. A client will typically be a Parser, which is nothing more than a relay to the DecoderServer. The Parser class can read files and parse their input format, convert it to something the DecoderServer understands, and ultimately send the result its way.

While the whole model sounds nice in theory, in practice I found it to be hardly usable, if at all. It's largely undocumented, and the only intended use-case seems to be running that thing from command line. Implementing it as a backend for an applet in a Tomcat Servlet, I had great trouble getting it going (for example, Bikel suffers from the same resource-handling-dementia as everybody else in NLP, in that the API wants a String denoting the location of the model file. Internally, the class then opens an ObjectInputStream to the location and reads the serialized model from that. How stupid is this? Why not take an ObjIS directly? In a secure Tomcat environment, you can only access stuff via jndi, which makes this approach useless.)

But as far as parsers go, this is a good one. Pretty fast, pretty stable, and with a rather nice coverage.

Aleksandar Dimitrov
What kind of deep parsing does it support?
underspecified
There's a paper about it - I'm currently printing it. As soon as I've read it, I'll fill you up on the details.
Aleksandar Dimitrov
OK, I've added some infos now :-)
Aleksandar Dimitrov
+5  A: 

The Curran & Clark Parser (available here: http://svn.ask.it.usyd.edu.au/trac/candc/wiki) parsers words into CCG categories based on the contexts around it to create compositionally well-formed sentences using four basic rules of satisfying non-basic types.

Theoretically speaking, it's extremely computationally sound (nothing Turing complete, rare departures from context-free), amenable to the addition of the kinds of features found in unification grammar, and extremely fast.

For something with a little bit more semantics, look into OpenCCG, which uses a more fully fleshed out grammmar, but requires a bit more prepping.

Robert Elwell
+6  A: 

I found suitable for my self Berkely Parser, which can be found here

  • It's written on JAVA respectively it's platform independent
  • It is a statistical NL parser and could process any language. So far resources are available for Chinese, English, French and German and I'm using it for Bulgarian
  • It's distributed under GNU General Public License
  • It accepts Penn format annotated data (in Setence per line fashion).
  • It performes deep parsing
  • It doesn't rely on any additional tools like sentence splitter, POS tagger etc...
maybe remove those spacings so it becomes a list instead of code? =)
Svish
I can't find any in depth documentation for this.
Rosarch
@Rosarch: There is not much but the code is reasonably easy to read.
Nathan Sanders
+4  A: 

The Stanford NLP, http://nlp.stanford.edu/software/lex-parser.shtml

Java

English, Arabic, Chinese, German

GNU GPL v2 (Non profit only)

Statistical Part of Speech.

Under development, there is a stable release as of September 09, I believe.

The parsing depth (shallow, deep, wide-coverage ...).

The underlying algorithm (bottom-up, top-down, Tomita, Earley, left corner (LRN, which N?>)).

Tools it relies on (specific taggers/tagsets, tokenizer/chunker or even lemmatizer).

Don't know these, I'm a pretty light user. :-p

piggles
It's a deep parser, which does PCFG parsing (Penn Treebank) and can convert its PCFG parses into dependency parses. It doesn't rely on any other tools, it can do it's own tokenization.
Ken Bloom
@Ken: Thanks :)
piggles
A: 

I'm currently looking the Enju parser. I have used stanford parser, berkeley and PET.

The below observation is based on a sample dataset of sentences from a few websites (cleaned). While stanford had a measure-able degree of errors, berkeley's parsing was better the tagging and parsing still lacked accuracy. PET was by far the best parser compared to the previous two. But it is difficult to set up and use.

So my search for a better parser has currently led me to Enju parser. It is also a probabilistic HPSG parser which gives both phrase structures and predicate-argument structures. We can convert the xml output to PTB tag format

  • Enju Parser
  • C++
  • English, tained on penn, brown and genia
  • Free for research purpose.
  • HPSG
  • active development, stable as far as I have used.
  • deep parser
  • supertagger, up - a parser for unification-based grammars. (packaged with the parser)
Sharmila