views:

123

answers:

2

I am parsing a text file and want to be able to extend the sets of tokens that can be recognized easily. Currently I have the following:

if line =~ /!DOCTYPE/ 
     puts "token doctype   " + line[0,20]   
     @ast[:doctype] << line
  elsif line =~ /<html/ 
     puts "token main HTML start   " + line[0,20]
     html_scanner_off = false
 elsif line =~ /<head/ and not html_scanner_off
     puts "token HTML header starts   " + line[0,20]
     html_header_scanner_on = true
  elsif line =~ /<title/ 
     puts "token HTML title   " + line[0,20]
     @ast[:HTML_header_title] << line 
  end

Is there a way to write this with a yield block, e.g. something like:

scanLine("title", :HTML_header_title, line)

?

+2  A: 

If you're intending to parse HTML content, you might want to use one of the HTML parsers like nokogiri (http://nokogiri.org/) or Hpricot (http://hpricot.com/) which are really high-quality. A roll-your-own approach will probably take longer to perfect than figuring out how to use one of these parsers.

On the other hand, if you're dealing with something that's not quite HTML, and can't be parsed that way, then you'll need to roll your own somehow. There's a few Ruby parser frameworks out there that may help, but for simple tasks where performance isn't a critical factor, you can get by with a pile of regexps like you have here.

tadman
+2  A: 

Don't parse HTML with regexes.

That aside, there are several ways to do what you're talking about. One:

class Parser
        class Token
                attr_reader :name, :pattern, :block
                def initialize(name, pattern, block)
                        @name = name
                        @pattern = pattern
                        @block = block
                end

                def process(line)
                        @block.call(self, line)
                end
        end

        def initialize
                @tokens = []
        end

        def scanLine(line)
                @tokens.find {|t| line =~ t.pattern}.process(line)
        end

        def addToken(name, pattern, &block)
                @tokens << Token.new(name, pattern, block)
        end
end

p = Parser.new
p.addToken("title", /<title/) {|token, line| puts "token #{token.name}: #{line}"}
p.scanLine('<title>This is the title</title>')

This has some limitations (like not checking for duplicate tokens), but works:

$ ruby parser.rb
token title: <title>This is the title</title>
$

sh-beta
great!thank you for the entry point to meta-programming... for other, this presentation also helps to understand what's the idea: http://www.slideshare.net/faro00oq/metaprogramming-with-ruby
poseid