views:

236

answers:

3

I'm looking for a Java library to help parse user entered text that represents an 'appointment' for a calendar application. For instance:

Lunch with Mike at 11:30 on Tuesday

or

5pm Happy hour on Friday

I've found some promising leads like https://jchronic.dev.java.net/ and http://www.datejs.com/ which can parse dates - but I also need to be able to extract the title of the event like "Lunch with Mike".

If such an API doesn't exist, I'm also interested in any thoughts on how best to approach the problem from a coding perspective.

A: 

I can't think of anything at the top of my head that would do that to your specifications. You could try the Stanford NLP Java package or OpenNLP. However that might be a sledgehammer solution to what your trying to do.

Alternatively you can try parsing it yourself. Use JFlex to scan the input and tokenize and CUP to create a grammar if you want to handle more input.

aduric
+3  A: 

Extending JChronic may be your best bet. I think, given the responses to this question, it's unlikely that a pre-built library for this exists (though it seems like such a thing could be useful... I'm guessing that the major use-cases for parsing natural language dates would be even more useful if they had the ability to extract additional data from user-supplied strings).

Implementation-wise, probably the most straight-forward thing to do is to extend JChronic, since, it supports quite a significant part of your use-case, but more over as you can see from the unit test extraneous information should already be ignored by the framework. Fortunately, too, if you look at the main class, it should not be too hard to extend / modify / wrap the parse() method to support a custom scanner for an event title. (My own preference of these would be to wrap the framework rather than fork and modify it, as that allows you to benefit from any improvements to the underlying code more easily).

Ultimately, what may prove the most straight-forward way of doing this is to generate a regex-parser that ignores most of what JChronic tries to capture (and this would mean becoming deeply familiar with the JChronic source code).

The key to successfully implementing this, as with any NLP-type project is to have as many examples as you can possibly get, preferrably as automated unit tests (ultimately, even if the test cases test duplicate the same functionality many times, it is better to have more examples than fewer). Fortunately, since we're talking about natural language, such test cases should be particularly easy to get, since even non-programmer friends, family, etc. should be able to provide you with "event descriptions" (or whatever you want to call them). You'll also want to especially focus on edge cases where the date-parsing bit might interfere with the location / title parsing bit (for example in "sigur rós at 8pm" the "at" is clearly part of the time whereas in "party at phoebe's saturday" it clearly isn't).

I realize I said quite a bit about JChronic, but I feel that it's a natural choice for your problem as it already covers much of the "hard part" of parsing natural-language "appointments", i.e., the fuzziness of our language that we use about time, and is already implemented in the language you are targetting.

ig0774
+1  A: 

There are two relatively straightforward ways of trying to extract the appointment names.

Use a Sequence Labeling Package

If you have a labeled data set, you could train a sequence model, using packages like CRF++ or Yamcha, to pull out appointment titles like "Lunch with Mike".

Use Named Entities and Rules

If you don't have a labeled dataset, you could probably get some milage out of using a named entity recognizer to tag all the people, locations, and organizations in the appointment text. As a bonus this will also give you times & dates, so you won't need to write your own code to pull those out.

With the named entities all labeled, it should be pretty straight forward to write some rules to extract or construct titles for each appointment.

If you're looking for a Java based NER tagger, you could use the one released by Stanford or the one distributed with OpenNLP

dmcer