views:

558

answers:

5

For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.

I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple langauge, say CSS. And I wanna do this right.

This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.

I find my self writing code like this (hopefully you can infer the rest from this snippet):

public CssToken ReadNext()
{
    int val;
    while ((val = _reader.Read()) != -1)
    {
        var c = (char)val;
        switch (_stack.Top)
        {
            case ParserState.Init:
                if (c == ' ')
                {
                    continue; // ignore
                }
                else if (c == '.')
                {
                    _stack.Transition(ParserState.SubIdent, ParserState.Init);
                }
                break;

            case ParserState.SubIdent:
                if (c == '-')
                {
                    _token.Append(c);
                }
                _stack.Transition(ParserState.SubNMBegin);
                break;

What is this called? and how far off am I from something reasonable well understood? I'm trying to balence something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.

What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.

I'd really like to get into the mind set of writing reusable snippets of code. This Transition method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init) it will "call" a sub routine SubIdent which will, when complete, return to the Init state.

The parser will be implemented in much the same way, currently, having everyhing in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into seperate methods?

Any input/advice on the matter would be greatly appriciated!

+2  A: 

You need to write your own Recursive Descent Parser from your BNF/EBNF. I had to write my own recently and this page was a lot of help. I'm not sure what you mean by "with just code". Do you mean you want to know how to write your own recursive parser?

If you want to do that, you need to have your grammar in place first. Once you have your EBNF/BNF in place, the parser can be written quite easily from it.

The first thing I did when I wrote my parser, was to read everything in and then tokenize the text. So I essentially ended up with an array of tokens that I treated as a stack. To reduce the verbosity/overhead of pulling a value off a stack and then pushing it back on if you don't require it, you can have a peek method that simply returns the top value on the stack without popping it.

UPDATE

Based on your comment, I had to write a recursive-descent parser in Javascript from scratch. You can take a look at the parser here. Just search for the constraints function. I wrote my own tokenize function to tokenize the input as well. I also wrote another convenience function (peek, that I mentioned before). The parser parses according to the EBNF here.

This took me a little while to figure out because it's been years since I wrote a parser (last time I wrote it was in school!), but trust me, once you get it, you get it. I hope my example gets your further along on your way.

ANOTHER UPDATE

I also realized that my example may not be what you want because you might be going towards using a shift-reduce parser. You mentioned that right now you are trying to write a tokenizer. In my case, I did write my own tokenizer in Javascript. It's probably not robust, but it was sufficient for my needs.

 function tokenize(options) {
            var str = options.str;
            var delimiters = options.delimiters.split("");
            var returnDelimiters = options.returnDelimiters || false;
            var returnEmptyTokens = options.returnEmptyTokens || false;
            var tokens = new Array();
            var lastTokenIndex = 0;

            for(var i = 0; i < str.length; i++) {
                if(exists(delimiters, str[i])) {
                    var token = str.substring(lastTokenIndex, i);

                    if(token.length == 0) {
                        if(returnEmptyTokens) {
                            tokens.push(token);
                        }
                    }

                    else {
                        tokens.push(token);
                    }

                    if(returnDelimiters) {
                        tokens.push(str[i]);
                    }

                    lastTokenIndex = i + 1;
                }
            }

            if(lastTokenIndex < str.length) {
                var token = str.substring(lastTokenIndex, str.length);
                token = token.replace(/^\s+/, "").replace(/\s+$/, "");

                if(token.length == 0) {
                    if(returnEmptyTokens) {
                        tokens.push(token);
                    }
                }

                else {
                    tokens.push(token);
                }
            }

            return tokens;
        }

Based on your code, it looks like you are reading, tokenizing, and parsing at the same time - I'm assuming that's what a shift-reduce parser does? The flow for what I have is tokenize first to build the stack of tokens, and then send the tokens through the recursive-descent parser.

Vivin Paliath
Just code means that you pick your language, mine is C#, and then you write your lexer and parser entierly within the language. No tooling is allowed. I would eventually love to end up with something like the Boost Spirit parser framework which is written entierly in C++ and will provide a C++ programming model for parser generation. That's acceptable, but a distant goal for me right now.
John Leidegren
It's funny you say that! I had to write a recursive-descent parser from scratch in Javascript :p.
Vivin Paliath
Thank's for the link, looks very helpful, but right now I'm currently just trying to write a lexer/tokenizer that I will later use for writing the parser.
John Leidegren
+1 the link to wikipedia shows the basics in short, in the dragon book (from 80's) they use a lookahead and match() instead accept() and expect()
stacker
It appears he has started down the path of writing a shift-reduce parser, so recursive-descent may be less intuitive to him.
Heath Hunnicutt
@Heath, what's shift-reduce ;) I still can't tell the difference between these two. I hate it when I get these damn shift-reduce or reduce-reduce conflicts... try to be very specific though, I know about the typicall if else if case.
John Leidegren
@Heath I just noticed that later. I vaguely recall the shift-reduce parser from school. I'll need to read up on it.
Vivin Paliath
@John same here - I vaguely remember the term :) I think it's something I need to look into. Recursive-descent is what I remember from school.
Vivin Paliath
Thanks for the JavaScript source, when I look at the parser, it doesn't look very formal. I'd really wanna get away from splitting and/or substringing on specific characters and patterns. Often that might do but I really want the formal lexer approach here, where my lexer is a well defined state machine.
John Leidegren
Ummm, look up shift-reduce first? Another name for it is "bottom-up parsing" -- http://en.wikipedia.org/wiki/Bottom-up_parsing
Heath Hunnicutt
@Heath ahh bottom-up - that makes more sense. That's what I remember from school@John, what I posted is simply the tokenizer. The parser doesn't perform any `substring` operations. The only place I use patterns is at the lowest level of the parse tree (for example, ensuring valid identifier names as such). Sorry I just realized I pointed you to a not-so-precise location. The parser actually starts at `constraints`
Vivin Paliath
@Heath Right, I rember now... But I'm still stuck trying to refactor that tokenizer into something less like one big method.
John Leidegren
@John you can have your tokenize only tokenize and return an array (or stack) of tokens. Then delegate the parsing to the recursive-descent parser, which will consume the stack as it parses.
Vivin Paliath
@Vivin I want the formal lexical analsysis, why is that part always left out? Maybe it's not really that exciting. Anyway, still stuck with state machine implemention in one big loop with lots of if statements and the occasional switch statement.
John Leidegren
@Vivin I believe that's the plan. I'm probably gonna start a new question when I get stuck with the parser part.
John Leidegren
@John A lexer simply converts a sequence of characters into a sequence of tokens (see http://en.wikipedia.org/wiki/Lexical_analysis). That's also what a tokenizer does. So your first task is to identify what constitutes a token. You can figure that out from your grammar.
Vivin Paliath
@Vivin Right! Check out Dietrich's answer, it explains a lot.
John Leidegren
+1  A: 

It looks like you want to implement a "shift-reduce" parser, where you explicitly build a token stack. The usual alternative is a "recursive descent" parser, in which depth of procedure calls build the same token stack with their own local variables, on the actual hardware stack.

In shift-reduce, the term "reduce" refers to the operation performed on the explicitly-maintained token stack. For example, if the top of the stack has become Term, Operator, Term then a reduction rule can be applied resulting in Expression as a replacement for the pattern. The reduction rules are explicitly encoded in a data structure used by the shift-reduce parser; as a result, all reduction rules can be found in the same spot of the source code.

The shift-reduce approach brings a few benefits compared to recursive-descent. On a subjective level, my opinion is that shift-reduce is easier to read and maintain than recursive-descent. More objectively, shift-reduce allows for more informative error messages from the parser when an unexpected token occurs.

Specifically, because the shift-reduce parser has an explicit encoding of rules for making "reductions," the parser is easily extended to articulate what sorts of tokens could legally have followed. (e.g., "; expected"). A recursive descent implementation cannot easily be extended to do the same thing.

A great book on both kinds of parser, and the trade-offs in implementing different kinds of shift-reduce is "Introduction to Compiler Construction", by Thomas W. Parsons.

Shift-reduce is sometimes called "bottom-up" parsing and recursive-descent is sometimes called "top-down" parsing. In the analogy used, nodes composed with highest precedence (e.g., "factors" in multiplication expression) are considered to be "at the bottom" of the parsing. This is in accord with the same analogy used in "descent" of "recursive descent".

Heath Hunnicutt
When it comes to writing the parser I belive I'd prefer the recursive descent variant. I believe I'm currently just using the stack and while loop thing to build a state machine, which will be my lexer... what's my alternative? is still need a lexer don't I?
John Leidegren
The state machine you are building is very similar to shift-reduce. This implies to me that you would have more fun doing it that way. Also, the end product is better.
Heath Hunnicutt
Hmm... shift-reduce until proven otherwise.
John Leidegren
+9  A: 

What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.

Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.

There are tools to do this. The lex tool and its descendants are well known and have been translated into many languages. The ANTLR toolchain also has a lexer component. My preferred tool is ragel on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.

If you do want to write your own lexer by hand, good ones often look something like this:

function readToken() // note: returns only one token each time
    while !eof
        c = peekChar()
        if c in A-Za-z
            return readIdentifier()
        else if c in 0-9
            return readInteger()
        else if c in ' \n\r\t\v\f'
            nextChar()
        ...
    return EOF

function readIdentifier()
    ident = ""
    while !eof
        c = nextChar()
        if c in A-Za-z0-9
            ident.append(c)
        else
            return Token(Identifier, ident)
            // or maybe...
            return Identifier(ident)

Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).

Dietrich Epp
You know, If I had this kind of help when I did my thesis things would have been some much more understandable. Thanks for a great answer! No combining of the lexer and parser will ever take place, that much I've learned the hard way.
John Leidegren
Thanks for going into detail about lexers! +1
Vivin Paliath
I have but 1 question left for you, where does `c` go? When I say return `IDENTIFIER` where does the token value reside? This is a trivial question but I have to ask! If I just put the character in a string builder when do I flush that and do I ever run the risk of leaving junk in my string builder?
John Leidegren
I corrected the pseudocode to show one way to handle keeping track of `c`. Originally I had been vague because there are a few different ways people tend to keep track of token values.
Dietrich Epp
Yeah, try to build the lexical analyzer as one object that you can call, like Lexer.Next(), then build the syntactic analyzer, which will use the lexer. I DO recommend you that your lexer returns both the value of the identifier, as well as the type, ie: identifier:a,integer:3,string:"hello".this will let you do the syntactic analysis a lot easier.
Francisco Noriega
That's a lot better, thanks a bunch!
John Leidegren
+2  A: 

If you are going to hand code everything from scratch I would definately consider going with a recursive decent parser. In your post you are not really saying what you will be doing with the token stream once you have parsed the source.

Some things I would recommend getting a handle on
1. Good design for your scanner/lexer, this is what will be tokenizing your source code for your parser.
2. The next thing is the parser, if you have a good ebnf for the source language the parser can usually translate quite nicely into a recursive decent parser.
3. Another data structure you will really need to get your head around is the symbol table. This can be as simple as a hashtable or as complex as a tree structure that can represent complex record structures etc. I think for CSS you might be somewhere between the two.
4. And finally you want to deal with code generation. You have many options here. For an interpreter, you might simply interpret on the fly as you parse the code. A better approach might be to generate a for of i-code that you can then write an iterpreter for, and later even a compiler. Of course for the .NET platform you could directly generate IL (probably not applicable for CSS :))


For references, I gather you are not heavy into the deep theory and I do not blame you. A really good starting point for getting the basics without complex, code if you do not mind the Pascal that is, is Jack Crenshaw's 'Let's build a compiler'
http://compilers.iecc.com/crenshaw/
Good luck I am sure you are going to enjoy this project.

Chris Taylor
I'm not gonna do code gen ;) And you are dead wrong, I love theory. It's just really hard to get started sometimes...
John Leidegren
@John, I gathered you are not going to code gen, that is why I believe recursive decent parser is a good option for you. I agree on the theory, if you have some practical experience it helps later to better grasp the theory.
Chris Taylor
A: 

If you want to use the parser to also handle not-well-formed expressions, you really want a recursive descent parser. Much easier to get the error handling and reporting usable.

For literature, I'd recommend some of the old work of Niklaus Wirth. He knows how to write. Algorithms + Data Structures = Programs is what I used, but you can find his Compiler Construction online.

Stephan Eggermont