tags:

views:

420

answers:

2

I'm working on a project that already has a C++ base. I would like to have a plug-in for some natural language processing. I really like GATE but I'm not sure if it's worth launching the JVM and splitting the project into C++ and Java portions. I noticed UIMA has a C++ framework, but have not tried it but seems to have less features than GATE.

Does anyone know of a better option than trying to wrap GATE somehow in C++ (eg better NLP library in C++)? If I do wrap GATE in C++, what is the best way? SOA?

Thanks

+2  A: 

A list of resources for NLP (POS Taggers, NP chunking, Sequence models, Parsers...) in C++ and other languages by Christopher Manning. Another one in Wikipedia.

Also there's Boost page for String and text processing.

anno
+1  A: 

Of course, it depends on what exactly what you want to do.

GATE and UIMA are both frameworks for NLP, mostly designed around the idea of information management and extraction. It's not really fair to say GATE has more features than UIMA, since strictly they are both only frameworks. However GATE is bundled with ANNIE which does have a lot of nice features which may be useful you (again, depending on what you want to do). UIMA is bundled with the OpenNLP libraries which mirror some, but not all, of these features, but are written in Java so would require loading the JVM.

You could find similar features to GATE/ANNIE or UIMA/OpenNLP using C++ libraries, but the nice thing about the two frameworks is that they are coherent and don't require a lot of 'glue code' to make individual libraries talk to each other.

What's the reason behind not wanting to wrap GATE in C++ code? I can appreciate that it would add to the complexity of the project, but if your worries are about performance/memory then the JVM may be the least of your worries. NLP tools tend to be very memory hungry, expect to give up half a gig for NER models, more for a statistical parser.

StompChicken
I'm an NLP newbie so I appreciate your insights! My concerns about Java are half memory/speed and half complexity of the project by adding more languages/compilers/etc.Do you know if UIMA in C++ is less of a resource hog than GATE? Is there a noticeable difference (20% or more in CPU time or RAM consumption)?
User1
Sorry, I've never used the C++ version. Most of the best NLP libraries are written in Java for some reason.
StompChicken