views:

943

answers:

6

A couple of days ago, I read a blog entry (http://ayende.com/Blog/archive/2008/09/08/Implementing-generic-natural-language-DSL.aspx) where the author discuss the idea of a generic natural language DSL parser using .NET.

The brilliant part of his idea, in my opinion, is that the text is parsed and matched against classes using the same name as the sentences.

Taking as an example, the following lines:

Create user user1 with email [email protected] and password test
Log user1 in
Take user1 to category t-shirts
Make user1 add item Flower T-Shirt to cart
Take user1 to checkout

Would get converted using a collection of "known" objects, that takes the result of parsing. Some example objects would be (using Java for my example):

public class CreateUser {
    private final String user;
    private String email;
    private String password;

    public CreateUser(String user) {
    this.user = user;
    }

    public void withEmail(String email) {
    this.email = email;
    }

    public String andPassword(String password) {
        this.password = password;
    }
}

So, when processing the first sentence, CreateUser class would be a match (obviously because it's a concatenation of "create user") and, since it takes a parameter on the constructor, the parser would take "user1" as being the user parameter.

After that, the parser would identify that the next part, "with email" also matches a method name, and since that method takes a parameter, it would parse "[email protected]" as being the email parameter.

I think you get the idea by now, right? One quite clear application of that, at least for me, would be to allow application testers create "testing scripts" in natural language and then parse the sentences into classes that uses JUnit to check for app behaviors.

I'd like to hear ideas, tips and opinions on tools or resource that could code such parser using Java. Better yet if we could avoid using complex lexers, or frameworks like ANTLR, which I think maybe would be using a hammer to kill a fly.

More than that, if anyone is up to start an open source project for that, I would definitely be interested.

+7  A: 

Considering the complexity of lexing and parsing, I don't know if I'd want to code all that by hand. ANTLR isn't that hard to pickup and I think it is worthing looking into based on your problem. If you use a parse grammar to build and abstract syntax tree from the input, its pretty easy to then process that AST with a tree grammar. The tree grammar could easily handle executing the process you described.

You'll find ANTLR in many places including Eclipse, Groovy, and Grails for a start. The Definitive ANTLR Reference even makes it fairly straightforward to get up to speed on the basic fairly quickly.

I had a project that had to handle some user generated query text earlier this year. I started down a path to manually process it, but it quickly became overwhelming. I took a couple days to get up the speed on ANTLR and had an initial version of my grammar and processor running in a few days. Subsequent changes and adjustments to the requirements would have killed any custom version, but required relatively little effort to adjust once I had the ANTLR grammars up and running.

Good luck!

Joe Skora
Joe, thanks. I added that book to my cart on Amazon a couple of times. Do you think it would be easy to create dynamic grammar trees based on the registered parsers?The library would have to use reflection to extract class name, methods, (...) and create the grammar tree for ANTLR, right?
kolrie
You can insert Java (or another, ANTLR can generate a variety of languages) directly into the grammar. I used one grammar to parse my DSL and a second to walk the AST tree, processing the nodes. Since it all this runs in your app it can easily create objects and call methods.
Joe Skora
It took a couple of days to get my head wrapped around ANTLR, having never taken a lexer/parser/compile course. I am very glad I did it as it will be useful again and again in the future. Parr wrote ANTLR so the book is a great resource and a well written introduction to lexing and parsing too.
Joe Skora
Joe, your feedback was excellent, I will definetiy buy the book.
kolrie
If you got to the Pragmatic Bookshelf site (http://pragprog.com/titles/tpantlr/the-definitive-antlr-reference) you can get the book and a PDF copy for 45.75 + shipping. Good luck. You won't regret picking up an new tool for your skill set.
Joe Skora
I'll buy it for sure. I am on the same situation you are now, I don't have any experiences on the lexer/parser/compile trio as well. And that has always been something on my furute learning list. Now is the time! Thank you again for your valuable feedback.
kolrie
+2  A: 

You might want to consider Xtext, which internally uses ANTLR and does some nice things like auto-generating an editor for your DSL.

Fabian Steeg
+1  A: 

You might find this multi-part blog series I did on using Antlr to be useful as a starting point. It uses Antlr 2, so some stuff will be different for Antlr 3:

http://tech.puredanger.com/2007/01/13/implementing-a-scripting-language-with-antlr-part-1-lexer/

Mark Volkman's presentations/articles on Antlr are quite helpful as well:

http://www.ociweb.com/mark/programming/ANTLR3.html

I will second the suggestion about the Definitive ANTLR book, which is also excellent.

Alex Miller
+1  A: 

The first time I heard of DLS was from Jetbrains, the creator of IntellJ Idea.

They have this tool MPS ( Meta Programming System )

Here's the link: MPS

OscarRyz
A: 

"One quite clear application of that, at least for me, would be to allow application testers create "testing scripts" in natural language and then parse the sentences into classes that uses JUnit to check for app behaviors"

What you are talking about here sounds exactly like the tool, FitNesse. Exactly as you describe, clients write acceptance tests "scripts" in some kind of language that makese sense to them, and programmers build systems that make the tests pass. Even the implementation you talk about is pretty much exactly how FitNesse works - the vocabulary used in the scripts are concatenated to form function names etc, so that the FitNesse framework knows what function to call.

Anyway, check it out :)

+1  A: 

If you call that "natural language", you're deluding yourself. It's still a programming language, just one that tries to mimic natural language - and I suspect that it will fail once you get into implementation details. In order to make in unambiguous, you'll have to put restrictions on the syntax that will confuse the users who've been led to think that they're writing "English".

The advantage of a DSL is (or should be, at any rate) is that it's simple and clear, yet powerful in regard to the problem domain. Mimicking a natural language is a secondary concern, and may in fact be counter-productive to those primary goals.

If someone is too stupid or lacks the ability for formally rigorous thinking that's required for programming, then a programming language that mimicks a natural one will NOT magically turn them into a programmer.

When COBOL was invented, some people seriously believed that within 10 years there would be zero demand for professional programmers, since COBOL was "like English", and anyone who needed software could write it himself. And we all know how that's been working out.

Michael Borgwardt