views:

476

answers:

4

I'd like to write an idiomatic parser for a markup language like Markdown. My version will be slightly different, but I perceive at least a minor need for something like this in Clojure, and I'd like to get on it.

I don't want to use a mess of RegExes (though I realize some will probably be needed), and I'd like to make something both powerful and in idiomatic Clojure.

I've begun a few different attempts (mostly on paper), but I'm terribly happy with them, as I feel as though I'm just improvising. That would be fine, but I've done plenty of exploring in the language of Clojure in the past month or two, and would like to, at least in part, follow in the paths of giants.

I'd like some pointers, or suggestions, or resources (books from O'Reilly would be awesome–love me some eBooks–but Amazon or wherever would be great, too). Whatever you can offer.

Thanks so much,

Isaac

EDIT Brian Carper has an interesting post on using ANTLR from Clojure.

There's also clojure-pg and fnparse, which are Clojure parser-generators. fnparse even looks like it's got some decent documentation.

Still looking for resources etc! Just thought I'd update these with some findings of my own.

+4  A: 

Best I can think of is that Terrence Parr - the guy that leads the ANTLR parser generator - has written a markup language documented here. Anyway, there's source code there to look at.

Steve Cooper
Interesting–I'll have to check that out. Alas, written in Java, but some of the ideas would certainly be applicable, even if the style isn't. (Noticed tons of "filler" classes, but hey, that's how you do it–just harder to conceptualize in a functional language.)
Isaac Hodes
Well, its clojure right? Java-interop is excellent, why not write a clojure API for ANTLR? Seems like a project many would benefit from...
mcpeterson
That's true, people might, but I've had more than my share of wrapping Java libs over the past month. Time for a little more Just Clojure™. That is definitely something for me to keep in mind, though!
Isaac Hodes
+4  A: 

There is also clj-peg project, that allows to specify PEG grammar for parsing data

Alex Ott
Ah, that's pretty neat. Nice find! I'm looking into that.
Isaac Hodes
I've started checking out the source, (after having looked through the docs), and this seems like it might be the answer. It hasn't been updated, as far as I can see, since February. I've emailed the guy who wrote it, and asked him if he might be interested in stick it on GitHub. Thanks for the tip!
Isaac Hodes
He's emailed me back, saying there will be a major update/rewrite coming up quite soon! FYI
Isaac Hodes
glad to know this
Alex Ott
+2  A: 

Two functional markup translators are;

Steve Cooper
Thanks! Those are some good resources. My Haskell is a bit weak, but I might be able to make some sense of it, but my OCamle is nonexistent. Thanks!
Isaac Hodes
+4  A: 

Another not yet mentioned here is clarsec, a port of Haskell's parsec library.

I've recently been on a very similar quest to build a parser in Clojure. I went pretty far down the fnparse path, in particular using the (yet unreleased) fnparse 3 which you can find in the develop branch on github. It is broken into two forms: hound (specifically for LL(1) single lookahead parsers) and cat, which is a packrat parser. Both are functional parsers built on monads (like clarsec). fnparse has some impressive work - the ability to document your parser, build error messages, etc is neat. The documentation on the develop branch is non-existent though other than the function docstrings, which are actually quite good. In the end, I hit some road-blocks with trying to make LL(k) work. I think it's possible to make it work, it's just hard without a decent set of examples on how to make backtracking work well. I'm also so familiar with parsers that split lexing and parsing that it was hard for me to think that way. I'm still very interested in this as a good solution in the future.

In the meantime, I've fallen back to Antlr, which is very robust, well-traveled, well-documented (in 2 books), etc. It doesn't have a Clojure back-end but I hope it will in the future, which would make it really nice for parser work. I'm using it for lexing, parsing, tree transformation, and templating via StringTemplate. It hasn't been entirely bump-free, but I've been able to find workable solutions to all problems so far. Antlr's unique LL(*) parsing algorithm lets you write really readable grammars but still make them fairly efficient (and tweak things gradually if they're not).

Alex Miller
Very interesting–I'll look into that as well. I will look in to Antlr again, but I think I'd like to help make an existing Clojure parser even better.
Isaac Hodes