I just found a new parser, written by Dan Bikel.
- The name is just "Parsing Engine", to be found here.
- It's written in Java, using the Penn Treebank Tagset
- Pre-trained data exists for English, Chinese and Arabic. Korean is announced as "Coming soon"
- The license is strange. Non-commercial for educational purposes, but you are not allowed to redistribute, and you have to agree to some sort of EULA.
- It's a statistical CYK parser, using "language-packs" to ease parsing of different languages.
- Active development, the guy just had a 1.0 release
- The parser seems to be pretty wide-coverage/deep parsing oriented.
- Comes with its own tagger, but can also read the Penn Treebank tagset in the typical Penn Lisp/oneliner format.
The whole thing is based on a client/server model. There is a class called Switchboard
, that serves as a hypervisor. Clients and Servers can register with the switchboard. A Server is typically a DecoderServer
, which serves as the actual parser. A client will typically be a Parser
, which is nothing more than a relay to the DecoderServer
. The Parser
class can read files and parse their input format, convert it to something the DecoderServer
understands, and ultimately send the result its way.
While the whole model sounds nice in theory, in practice I found it to be hardly usable, if at all. It's largely undocumented, and the only intended use-case seems to be running that thing from command line. Implementing it as a backend for an applet in a Tomcat Servlet, I had great trouble getting it going (for example, Bikel suffers from the same resource-handling-dementia as everybody else in NLP, in that the API wants a String
denoting the location of the model file. Internally, the class then opens an ObjectInputStream
to the location and reads the serialized model from that. How stupid is this? Why not take an ObjIS directly? In a secure Tomcat environment, you can only access stuff via jndi
, which makes this approach useless.)
But as far as parsers go, this is a good one. Pretty fast, pretty stable, and with a rather nice coverage.