tags:

views:

81

answers:

3

Hat in hand here. I'm a seasoned developer and I would be grateful for a bit of help. I don't have time to read or digest long intricate discussions on theoretical concepts around NLP (or go get my PHD). That said, I have read a few and it's a damn interesting field. The problem is I need real world solutions, for real world products, in real world time frames.

The problem I'm having is right now I'm not sure what the right questions are to ask to get started implementing. I believe this is mostly related to vocabulary. I'll read somewhere, a blog post, a forum post, a whitepaper, and it says, I'm doing flooping with the blargy blarg method, and I go google flooping and blargy blarg, and I get references to more obscurity. It seemingly never ends.

So, my question is multiphased.

First, more generally, how do I become passingly educated on this quickly? Just in time educated. I only need to know what I need to know to take the next step. I've spent 20 years writing code. Explain quick. I'll get it. (I mean provide a reference to something that explains quickly of course).

I'm happy to read the right book, but I don't want to read a book where I read the chapter introduction that explains what floopy floop is and then skip over the rest of the chapter with examples of floopy flooping (because now I get what it is). I also don't want to read a book that goes into too much detail with theoretical underpinnings or history. For example, the Jurafsky book seems like way more than I need: http://www.amazon.com/gp/product/0131873210. But I will read it if this is the right book to read. (It's also dang expensive!)

I need the root node of the expedited learning tree here, if you will. Point me in the right direction and I'll be quite grateful. I'm expecting quite a lot of firehose drinking - I just need the right firehose.

Second, what I need to do is take a single sentence, with a very reduced vocabulary, and get a grammar tree (sorry if this is the wrong terminology) that I can do something with.

I know I could easily write this command line input style in c in a more conventional manner, but I need it to be way better than that. But I don't need a chatterbot either.

What I'm doing needs to live in a constrained environment. I can't use Python (unfortunately). I can't ship with gigabytes of corpuses. I need any libraries I use to be in c/c++. If I have to write this myself, I will. Hopefully, it will be achievable considering the reduced problem set. Maybe, probably, that's just naive. If so, let me know. :-)

Thanks in advance - Mike

+2  A: 

Parsing is a fairly hard problem. Learning to write a good parser from scratch is on the order of a masters degree.

If you wanted to build a tagger or a classifier, there are relatively straightforward algorithms that you can learn without a complete grip on all the math. Parsing for trees is another story.

To get some idea, have a look here. There's a relatively state-of-the-art parser system here, and relevant links to the data resources. If you want to build a parser for a language that doesn't have a Penn TreeBank, for example, you have a rather large data problem to solve.

It might also help if you told us what real-world problem you want to solve. There are a lot of very useful NLP things that can be done without a parser.

Edit

I've been struggling to formulate a useful response to your, ahem, existential situation here.

You'd like to extract information from free-form input typed by people, a-la Zork.

You'd heard that there's this field, NLP, in which people have had some success in 'Natural Language Processing.' You're hoping to be able to find implementations or algorithms accessible to a competent professional program without a gigantic self-education problem.

Here's a biased historical explanation of the problem. Once upon a time, there was a field called AI. As part of AI, smart people set out to try to solve the problem of understanding natural language.

The most important early discovery was this that was a very, very, very, hard problem. What happened next was the natural thing. People set out to split the big, impossible, problem into smaller, less impossible problems. Several levels of splitting ensued.

Now, many years later, some of those smaller problems have really well-understood solutions. Some of them even have open source implementations that are relatively easy for anyone to deploy. See Weka or Apache Mahout for an example. Unfortunately for you, these subproblems are not what you are looking for.

The original, really big, 'AI' problem is far from solved, and the problem you want to tackle is fairly far up the food chain.

As I understand the state of the field, the best way to tackle the problem you have set yourself does not involve parse trees. The most successful systems that have done this have used relatively simple template patterns to identify things they are interested in. Think 'regular expressions'. Think, sadly, 'hacks.' Google 'Eliza.' A major reason for this is that humans, invited to type at a system, do not type in full, grammatical sentences. They leave things out, mistype other things, and generally make it as hard as possible.

I'm sorry to be so discouraging, and I'd be very happy to read someone else's answer giving you a less pessimistic reading. On the other hand, I'd also encourage you to find an online copy of the paper entitled 'Artificial Intelligence Meets Natural Stupidity.'

bmargulies
Well, here's the problem. I don't know what a Penn TreeBank is. Or what the difference between a tagger or classifier is. Further, I don't know if I want a Penn TreeBank or not. I looked at the link, and found this: "The Pitman-Yor adaptor grammar sampler from our 2006 NIPS paper, now updated to estimate the Pitman-Yor hyperparameters as well using a Slice sampler and to use the C++ TR1 hash tables (August 2009)." This might as well be written in Sanskrit. What I'm asking for is help understanding all of this in an expedited manner.
Mike F
Also, for the sake of the discussion, what I'm trying to build is similar in scope to a text adventure interface, aka Zork.thanks
Mike F
You said you wanted a grammar. If what you want is to make sense of Zork input, then a grammar is not the way to go. Unfortunately, I have no really great advice on where you should start. I hope that someone else turns up with more encouraging advice, but mine would be to give up on a shortcut to understanding informal textual input.
bmargulies
Fair enough, thanks.
Mike F
+1  A: 

The Jurafsky & Martin book is the right book to read, or atleast have. It has a good index and a good bibliography and explains most of the concepts you need to know. There might also be a softcover version available which is considerably cheaper.

johanbev
Ok, thanks. Don't see a softcover, but I'll look again.
Mike F
+1  A: 

What you are seeking to do is often referred to as Natural Language Understanding, and solutions can run the gamut from simple pattern matching to methods that include statistical based parsing.

I guess it is important to identify how much flexibility you want to give the user in your dialogue interface. If you want to allow an open-ended, unrestricted natural language, then you will definitely need to start with Jurafsky and Martin. In particular I would read the chapters on parsing, semantics, and dialogue systems.

If the syntax and vocabulary is fairly restricted for your interface, I would suggest treating this problem more like computer language parsing. To do this I would read up on Lex and Yacc (or the GNU variants Flex and Bison) to get a feel for how you can lexicalize input and then place it into some sort of hierarchical structure. Either way, I suggest you read the wikipedia entries on Earley and CYK parsers (I would link to them, but the spam prevention won't allow me to post more than one link). It should shed some light on the basics of natural language parsing.

With either approach, there is a possibility that you will need to craft your own grammar. A simple more, semantic looking grammar might look something like this:

Sentence := CommandSentence
CommandSentence := Command DT Object PP DT Location
CommandSentence := Command DT Object PP Receiver
Command := give
Command := place
DT := a
DT := an
DT := the
Object := pencil
Object := sword
PP := on
PP := in
Location := cave
Location := table
Receiver := Bob
Receiver := Sally

A a parse for a command like "Place the sword in the cave" would have a simple parse like this:

(Sentence 
    (CommandSentence
        ((Command (place)) 
         (DT (the))
         (Object (sword))
         (PP (in))
         (DT (the))
         (Location (Cave))
        )
     )
  )

You might also just try playing with a readily available parser like the Collins Parser to see if you can extract useful information from its output.

Good luck!

hapagolucky
This is a extremely helpful answer, thank you.
Mike F