tags:

views:

48

answers:

1

Update: for the record, here's the implementation I ended up using.

Here's a trimmed down version of a parser I'm working on. There's still some code, but it should be quite easy to grasp the basic concepts of this parser.

class Markup
  def initialize(markup)
    @markup = markup
  end

  def to_html
    @html ||= @markup.split(/(\r\n){2,}|\n{2,}/).map {|p| Paragraph.new(p).to_html }.join("\n")
  end

  class Paragraph
    def initialize(paragraph)
      @p = paragraph
    end

    def to_html
      @p.gsub!(/'{3}([^']+)'{3}/, "<strong>\\1</strong>")
      @p.gsub!(/'{2}([^']+)'{2}/, "<em>\\1</em>")
      @p.gsub!(/`([^`]+)`/, "<code>\\1</code>")

      case @p
      when /^=/
        level = (@p.count("=") / 2) + 1 # Starting on h2
        @p.gsub!(/^[= ]+|[= ]+$/, "")
        "<h#{level}>" + @p + "</h#{level}>"
      when /^(\*|\#)/
        # I'm parsing lists here. Quite a lot of code, and not relevant, so
        # I'm leaving it out.
      else
        @p.gsub!("\n", "\n<br/>")
        "<p>" + @p + "</p>"
      end
    end
  end
end

p Markup.new("Here is `code` and ''emphasis'' and '''bold'''!

Baz").to_html

# => "<p>Here is <code>code</code> and <em>emphasis</em> and <strong>bold</strong>!</p>\n<p>Baz</p>"

So, as you can see, I'm breaking the text into paragraphs, and each paragraph is either a header, a list or a regular paragraph.

Is it feasible to add support for nowiki tags (where everything between <nowiki></nowiki> is not being parsed) for a parser like this? Feel free to answer "no", and suggest alternative methods of creating a parser :)

As a sidenote, you can see the actual parser code on Github. markup.rb and paragraph.rb

+3  A: 

If you make use of a simple tokenizer, it's much easier to manage this sort of thing. One approach is to create a single regular expression that can capture your entire grammar, but this might prove to be problematic. An alternative is to split up the document into sections that need to be rewritten, and sections that should be skipped, which is likely the easier approach here.

Here's a simple framework you can extend as required:

def wiki_subst(string)
  buffer = string.dup
  result = ''

  while (m = buffer.match(/<\s*nowiki\s*>.*?<\s*\/\s*nowiki\s*>/i))
    result << yield(m.pre_match)
    result << m.to_s
    buffer = m.post_match
  end

  result << yield(buffer)

  result
end

example = "replace me<nowiki>but not me</nowiki>replace me too<NOWIKI>but not me either</nowiki>and me"

puts wiki_subst(example) { |s| s.upcase }
# => REPLACE ME<nowiki>but not me</nowiki>REPLACE ME TOO<NOWIKI>but not me either</nowiki>AND ME
tadman
Is the splitting of the text into paragraphs, like my parser does, a form of tokenizer?
August Lilleaas
Using a very loose definition, perhaps. Generally a tokenizer splits up an input stream into different components that can be operated on individually with the finest level of granularity required. Splitting into paragraphs, and then later splitting into other parts is a kind of two-pass tokenizer. Generally when writing this sort of thing you can only get so far with a roll-your-own approach to parsing. At some point it's more efficient to go with a proper parser framework, but that's another subject.
tadman
Tagged as answer. Thanks!
August Lilleaas