views:

198

answers:

2

Ok, so I would like to make a GLR parser generator. I know there exist such programs better than what I will probably make, but I am doing this for fun/learning so that's not important.

I have been reading about GLR parsing and I think I have a decent high level understanding of it now. But now it's time to get down to business.

The graph-structured stack (GSS) is the key data structure for use in GLR parsers. Conceptually I know how GSS works, but none of the sources I looked at so far explain how to implement GSS. I don't even have an authoritative list of operations to support. Can someone point me to some good sample code/tutorial for GSS? Google didn't help so far. I hope this question is not too vague.

A: 

The question that you're asking isn't trivial. I see two main ways of doing this:

  1. The direct representation. Your data structure is represented in memory as node objects/structures, where each node has a reference/pointer to the structs below it on the stack (one could also make the references bi-directional, as an alternative). This is the way lists and trees are normally represented in memory. It is a bit more complicated in this case, because unlike a tree or a list, where one need only maintain a reference to root node or head node to keep track of the tree, here we would need to maintain a list of references to all the 'top level' nodes.

  2. The adjacency list representation. This is similar to the way that mathematicians like to think about graphs: G = (V, E). You maintain a list of edges, indexed by the vertices which are the origin and termination points for each edge.

The first option has the advantage that traversal can be quicker, as long as the GSS isn't too flat. But the structure is slightly more difficult to work with. You'll have to roll a lot of your own algorithms.

The second option has the advantage of being more straightforward to work with. Most algorithms in textbooks seem to assume some kind of adjacency list representation, which makes is easier to apply the wealth of graph algorithms out there.

Some resources:

There are various types of adjacency list, e.g. hash table based, array based, etc. The wikipedia adjacency list page is a good place to start.

Here's a blog post from someone who has been grappling with the same issue. The code is clojure, which may or may not be familiar, but the discussion is worth a look, even if not.

I should mention that I think that I wish there were more information about representing Directed Acyclic Graphs (or Graph Structured Stacks, if you prefer), given the widespread application of this sort of model. I think there is room for better solutions to be found.

Rob Lachlan
+1  A: 

Firstly, if you haven't already, you should read McPeak's paper on GLR http://www.cs.berkeley.edu/~smcpeak/papers/elkhound_cc04.ps. It is an academic paper, but it gives good details on GSS, GLR, and the techniques used to implement them. It also explains some of the hairy issues with implementing a GLR parser.

You have three parts to implementing a graph-structured stack.

I. The graph data structure itself

II. The stacks

III. GLR's use of a GSS

You are right, google isn't much help. And unless you like reading algorithms books, they won't be much help either.

I. The graph data structure

Rob's answer about "the direct representation" would be easiest to implement. It's a lot like a linked-list, except each node has a list of next nodes instead of just one.

This data structure is a directed graph, but as the McPeak states, the GSS may have cycles for epsilon-grammars.

II. The stacks

A graph-structured stack is conceptually just a list of regular stacks. For an unambiguous grammar, you only need one stack. You need more stacks when there is a parsing conflict so that you can take both parsing actions at the same time and maintain the different state both actions create. Using a graph allows you to take advantage of the fact that these stacks share elements.

It may help to understand how to implement a single stack with a linked-list first. The head of the linked list is the top of the stack. Pushing an element onto the stack is just creating a new head and pointing it to the old head. Popping an element off the stack is just moving the pointer to head->next.

In a GSS, the principle is the same. Pushing an element is just creating a new head node and pointing it to the old head. If you have two shift operations, you will push two elements onto the old head and then have two head nodes. Conceptually this is just two different stacks that happen share every element except the top ones. Popping an element is just moving the head pointer down the stack by following each of the next nodes.

III. GLR's use of the GSS

This is where McPeak's paper is a useful read.

The GLR algorithm takes advantage of the GSS by merging stack heads that have the same state element. This means that one state element may have more than one child. When reducing, the GLR algorithm will have to explore all possible paths from the stack head.

You can optimize GLR by maintaining the deterministic depth of each node. This is just the distance from a split in the stack. This way you don't always have to search for a stack split.

This is a tough task! So good luck!

Paul