views:

238

answers:

1
+3  Q: 

Ruby parse string

I have a string

input = "maybe (this is | that was) some ((nice | ugly) (day |night) | (strange (weather | time)))"

How is the best method in Ruby to parse this string ?

I mean the script should be able to build sententes like this :

maybe this is some ugly night

maybe that was some nice night

maybe this was some strange time

And so on, you got the point...

Should I read the string char by char and bulid a state machine with a stack to store the parenthesis values for later calculation, or is there a better approach ?

Maybe a ready, out of the box library for such purpose ?

+7  A: 

Try Treetop. It is a Ruby-like DSL to describe grammars. Parsing the string you've given should be quite easy, and by using a real parser you'll easily be able to extend your grammar later.

An example grammar for the type of string that you want to parse (save as sentences.treetop):

grammar Sentences
  rule sentence
    # A sentence is a combination of one or more expressions.
    expression* <Sentence>
  end

  rule expression
    # An expression is either a literal or a parenthesised expression.
    parenthesised / literal
  end

  rule parenthesised
    # A parenthesised expression contains one or more sentences.
    "(" (multiple / sentence) ")" <Parenthesised>
  end

  rule multiple
    # Multiple sentences are delimited by a pipe.
    sentence "|" (multiple / sentence) <Multiple>
  end

  rule literal
    # A literal string contains of word characters (a-z) and/or spaces.
    # Expand the character class to allow other characters too.
    [a-zA-Z ]+ <Literal>
  end
end

The grammar above needs an accompanying file that defines the classes that allow us to access the node values (save as sentence_nodes.rb).

class Sentence < Treetop::Runtime::SyntaxNode
  def combine(a, b)
    return b if a.empty?
    a.inject([]) do |values, val_a|
      values + b.collect { |val_b| val_a + val_b }
    end
  end

  def values
    elements.inject([]) do |values, element|
      combine(values, element.values)
    end
  end
end

class Parenthesised < Treetop::Runtime::SyntaxNode
  def values
    elements[1].values
  end
end

class Multiple < Treetop::Runtime::SyntaxNode
  def values
    elements[0].values + elements[2].values
  end
end

class Literal < Treetop::Runtime::SyntaxNode
  def values
    [text_value]
  end
end

The following example program shows that it is quite simple to parse the example sentence that you have given.

require "rubygems"
require "treetop"
require "sentence_nodes"

str = 'maybe (this is|that was) some' +
  ' ((nice|ugly) (day|night)|(strange (weather|time)))'

Treetop.load "sentences"
if sentence = SentencesParser.new.parse(str)
  puts sentence.values
else
  puts "Parse error"
end

The output of this program is:

maybe this is some nice day
maybe this is some nice night
maybe this is some ugly day
maybe this is some ugly night
maybe this is some strange weather
maybe this is some strange time
maybe that was some nice day
maybe that was some nice night
maybe that was some ugly day
maybe that was some ugly night
maybe that was some strange weather
maybe that was some strange time

You can also access the syntax tree:

p sentence

The output is here.

There you have it: a scalable parsing solution that should come quite close to what you want to do in about 50 lines of code. Does that help?

molf
Thanks, I have read the examples on the net, but I don't understand how I can read nested parentheses...
astropanic
Thanks Man ! You're my hero :)
astropanic
http://www.bestechvideos.com/2008/07/18/rubyconf-2007-treetop-syntactic-analysis-with-ruby , nice video
astropanic