views:

86

answers:

2

Am thinking about a project which might use similar functionality to how "Quick Add" handles parsing natural language into something that can be understood with some level of semantics. I'm interested in understanding this better and wondered what your thoughts were on how this might be implemented.


If you're unfamiliar with what "Quick Add" is, check out Google's KB about it.


6/4/10 Update
Additional research on "Natural Language Parsing" (NLP) yields results which are MUCH broader than what I feel is actually implemented in something like "Quick Add". Given that this feature expects specific types of input rather than the true free-form text, I'm thinking this is a much more narrow implementation of NLP. If anyone could suggest more narrow topic matter that I could research rather than the entire breadth of NLP, it would be greatly appreciated.

That said, I've found a nice collection of resources about NLP including this great FAQ.

+1  A: 

I would start by deciding on a standard way to represent all the information I'm interested in: event name, start/end time (and date), guest list, location. For example, I might use an XML notation like this:

<event>
    <name>meet Sam</name>
    <starttime>16:30 07/06/2010</starttime>
    <endtime>17:30 07/06/2010</endtime>
</event>

I'd then aim to build up a corpus of diary entries about dates, annotated with their XML forms. How would I collect the data? Well, if I was Google, I'd probably have all sorts of ways. Since I'm me, I'd probably start by writing down all the ways I could think of to express this sort of stuff, then annotating it by hand. If I could add to this by going through friends' e-mails and whatnot, so much the better.

Now I've got a corpus, it can serve as a set of unit tests. I need to code a parser to fit the tests. The parser should translate a string of natural language into the logical form of my annotation. First, it should split the string into its constituent words. This is is called tokenising, and there is off-the-shelf software available to do it. (For example, see NLTK.) To interpret the words, I would look for patterns in the data: for example, text following 'at' or 'in' should be tagged as a location; 'for X minutes' means I need to add that number of minutes to the start time to get the end time. Statistical methods would probably be overkill here - it's best to create a series of hand-coded rules that express your own knowledge of how to interpret the words, phrases and constructions in this domain.

Tommy Herbert
NLTK is a fantastic resource and this approach seems similar to my own thinking! Would you be aware of any PHP-based toolkits which you could recommend. I understand the limitations of PHP regarding speed to perform such a complex operation, but am interested in leveraging HipHop (http://developers.facebook.com/blog/post/358) solve this challenge.
Michael
I'm afraid I don't know of anything. The following blog entry suggests you need to either roll your own or jump through hoops to use NLTK from PHP. It's a year and a half old, though.http://www.akshatsinghal.com/content/natural-language-processing-php
Tommy Herbert
A: 

It would seem that there's really no narrow approach to this problem. I wanted to avoid having to pull along the entirety of NLP to figure out a solution, but I haven't found any alternative. I'll update this if I find a really great solution later.

Michael