views:

47

answers:

2

Hi I'm looking to make a pseudo Markdown kind of language and a parser to parse it into xhtml.

I've never written a compiler... I've taken brief looks at ANTLR and am wondering if ANTLR can handle parsing things with meaningful whitespace?

So say I have something like this:

some text

  some other text

  # bullet point

    # nested bullet point

Depending on context and number of prefixing spaces, those lines would mean different things.

What is a good tool to use to write a parser for this?

Thanks, Alex

+1  A: 

My approach would be to make your lexer generate indent/outdent tokens. Store the current indentation level and match a pattern like \n *. Count the number of spaces and if it is different to the current indentation level, emit an indent/outdent token.

Similarly, count tabs at start-of-line. Inserting a rule that throws an error up on a pattern of \n[ \t]* should stop people mixing tabs and spaces.

Jack Kelly
+2  A: 

ANTLR can surely be used for this. However, if you're new to ANTLR or parser-generators in general, I don't think I can give a short explanation of how to do this exactly. I recommend you try some simple things with ANTLR and browse through The Definitive ANTLR Reference. It even has a paragraph about this type of problem which is similar to parsing Python code. See Chapter 4.3 Rules, paragraph Emitting More Than One Token per Lexer Rule for details.

Bart Kiers