views:

542

answers:

1

I am currently trying to write a Treetop grammar to parse Simple Game Format files, and have it mostly working so far. However, there are a few questions that have come up.

  1. I am unsure how to actually access the structure Treetop generates after a parse.
  2. Is there a better way to handle capturing all characters than my chars rule?
  3. There is a case for comments that I can't seem to write correctly.

    C[player1 [4k\]: hi player2 [3k\]: hi!]

I can't wrap my head around how to deal with the nested structure of the C[] node with []'s inside them.

The following is my current progress.

sgf-grammar.treetop

grammar SgfGrammar
rule node
 '(' chunk* ')' {
  def value
   text_value
  end
 }
end

rule chunk
 ';' property_set* {
  def value
   text_value
  end
 }
end

rule property_set
 property ('[' property_data ']')* / property '[' property_data ']' {
  def value
   text_value
  end
 }
end

rule property_data
 chars '[' (!'\]' . )* '\]' chars / chars / empty {
  def value
   text_value
  end
 }
end

rule property
 [A-Z]+ / [A-Z] {
  def value
   text_value
  end
 }
end

rule chars
 [a-zA-Z0-9_/\-:;|'"\\<>(){}!@#$%^&\*\+\-,\.\?!= \r\n\t]*
end

rule empty
 ''
end
end

And my test case, currently excluding C[] nodes with the above mentioned nested bracket problem:

example.rb

require 'rubygems'
require 'treetop'
require 'sgf-grammar'

parser = SgfGrammarParser.new
parser.parse("(;GM[1]FF[4]CA[UTF-8]AP[CGoban:3]ST[2]
RU[Japanese]SZ[19]KM[0.50]TM[1800]OT[5x30 byo-yomi]
PW[stoic]PB[bojo]WR[3k]BR[4k]DT[2008-11-30]RE[B+2.50])")
+3  A: 
  1. The structure comes back to you as a tree of SyntaxNodes (if the result is nil, check parser.failure_reason). You can walk this tree or (and this is recommended) you can augment it with functions that do what you want and just call your main function on the root.

If what you mean is "how do you access the components from within a node function?" there are several ways. You can get at them with the element[x] notation or by rule:

rule url_prefix
    protocol "://" host_name {
       def example
           assert element[0] == protocol
           assert element[2] == host_name
           unless protocol.text_value == "http"
               print "#{protocol.text_value} not supported" 
               end
           end
       }

You can also name them like so:

rule phone_number
    "(" area_code:( digit digit digit ) ")" ...

and then refer to them by name.

  1. Your chars rule looks fine if you only want to match those characters. If you want to match any character you can just use a dot (.) like in a regular expression.

  2. I'm not familiar with the language you are trying to parse, but the rule you are looking for may be something like:

rule comment
    "C" balanced_square_bracket_string
    end
rule balanced_square_bracket_string
    "[" ( [^\[\]]  / balanced_square_bracket_string )* "]"
    end

The middle part of the second rule matches anything that isn't a square bracket or a nested string with balanced_square brackets.

P.S. There is a fairly active Google group, with archives online & searchable.

MarkusQ
Thanks for the tips MarkusQ. I'm going to continue to wrap my head around this, but you've got me going in the right direction.
bojo