views:

108

answers:

2

Folks I'm implementing a weird thing, I have to write a utility to parse a syntax diagram in plain text format and convert it to xml format, the thing basically is identical as this from IBM(like in the "Creating a No-Conversion Job" part): http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.sqls.doc/sqls17.htm Typical parser/lexer like ANTLR / yacc/ bison seems cannot deal with this kind of stuff, one idea I have is to convert the syntax diagram to a character bitmap and define some function like more_up,move_down, left, right or so in order to traverse the whole diagram to simulate the understanding process as human naked eye. Tho it sounds not proficient enough, I didn't figure out other better approach. Did anybody once play with a similar scenario? Maybe you could kindly shed some light on this.

Thank you in advance!

+2  A: 

I've never done anything like that before, but this is how I would approach it.

First, I'd start off with something like this:

class CharGrid(object):
    def __init__(self, text):
        self.lines = text.split('\n')

    def __getitem__(self, pos):
        try:
            col, row = pos
        except (TypeError, ValueError):
            raise KeyError('%r not a 2-tuple' % (pos,))
        if row >= len(self.lines):
            return ' '
        line = self.lines[row]
        if col >= len(line):
            return ' '
        return line[col]

so that I can access the characters in the text via 2D coordinates:

grid = CharGrid("""Creating a No-Conversion Job

>>-onpladm create job--job--+--------------+-- -n--------------->
                            '- -p--project-'

>-- -d--device-- -D--database-- -t--table----------------------->

   .---------------------------------------------------------------------.
   V                                                                     |
>----+-----------------------------------------------------------------+-+-><
     |                                                            (1)  |
     '-+-------------+--+-------------+--| Setting the Run Mode |------'
       '- -S--server-'  '- -T--target-'
""")

print ''.join((grid[0,0], grid[1,0], grid[2,0]))
print ''.join((grid[0,2], grid[1,2]))

(yielding)

Cre
>>

After that, the task would be converting the 2D grid of characters into a 1D sequence of symbols:

  1. read the label off the first line
  2. scan down the first column until you find >>
  3. scan right from the current position until you find [whatever]

... etc. Follow the chart in eyeball-order.

Once you have a 1D sequence of symbols, you can use a conventional parsing technique on that.

Matt Anderson
Matt, thanks for the pretty neat code, there are a few things I still need to consider like being sensible to optional / mandatory parameter and determination of parameter name / value, you class is a good start :)
Ripley
+1  A: 

The "character grid" idea for accessing single characters seems like a foundation step; another answer shows how to do that just fine. Now you can access the grid randomly and follow horizontal or vertical lines easily.

The real problem is that you want to construct a graph representing what the character grid says. Such a graph will consist of (duh), nodes, arcs, and annotations.

Probably the easiest thing to find are the nodes, which are likely indicated (see other answer) by characters representing branching points in the diagram (e.g. +). Each arc will be a string of characaters leading to a bend in the arc, or to another node. Following such strings of characters should be pretty straighforward (:-) ) and can produce a string representing the arc even if it has bends in it.

You'll likely want to enumerate all nodes (just scan the array). Node annotations must reasonably be nearby and you can simply scan a small radious around the node locations.

You'll want to enumerate each arc leaving a node, and collect the string representing the arc.

I'd feed the arc-string to a lexer to tear it apart; it may have interesting content (e.g., an annotation as in inline sequence of characters).

At this point you have nodes and arcs with associated annotations. Constructing the corresponding graph from these should be pretty easy.

Ira Baxter
Ira, your approach makes perfect sense, while something else seems need to be considered as well. Like some arcs are not topological movable (they imply the optional or mandatory parameters), and due to this the parser have to be sensible if the current node is on the main trunk or a side branch, or how deep it's been nested. My feeling is traversal the nodes from left to right will require fewer efforts, coding to figure these out ...
Ripley
"Some details left as an exercise for the reader" :-}
Ira Baxter