views:

1486

answers:

5
+5  Q: 

Which Wiki Parser?

Does anyone know of a parser that can take Wiki formatted text as input and produce a tree of entities, in the same way that an XML parser produces an entity tree? To clarify, I'm looking for something that would take text like:

 -Intro-
 Textual stuff in ''italics''
 --Subhead--
 Yet more text

and produce a tree rooted at Intro with three child nodes one of which (Subhead) itself has a child. I'm looking for something that can understand the "simple" wiki format described at http://meta.wikimedia.org/wiki/Help:Wikitext.

I'm aware of several lexers for Wiki text, but no tree parsers. I'm looking for something Open source and written in C or C++.

+1  A: 

You can't do it directly from a wiki-formatted page because the wiki format doesn't have complete information. Instead, the wiki format text is translated basically by a bunch of regular-expression rules and inserted into a predefined page framework in HTML or XHTML.

The easiest way to do what you want is to find an appropriate formatter for some lightweight text format (like textile or creole), pass that through to generate XHTML, and then parse the XHTML using any regular parser.

Charlie Martin
+1  A: 

You may get some ideas out of this Perl module:

http://search.cpan.org/dist/HTML-WikiConverter-MediaWiki/

I understand you're looking for C/C++, but hey, you might get some goodness.

Andy Lester
+2  A: 

What I would do is

  1. Write a BNF syntax for that wiki language. As it is simple, the BNF will also be simple.
  2. Use The Spirit Framework to create a parser for it. It is really simple (for that simple things), and the BNF syntax is translated into C++ so naturally.
Diego Sevilla
+2  A: 

I've written a parser, which internally creates such a tree in Java: Java Wikipedia API

Maybe you can get some ideas for your C or C++ implementation?

The HTMLConverter class takes the internal nodes tree to convert it to HTML markup.

axelclk
+2  A: 

You may want to take a look at Mylyn WikiText, which is a parser that uses the Builder design pattern to convert wiki markup to various XML formats. It ships with builders for HTML, Eclipse Help, DITA and DocBook. You can use your own builder to customize the output.

The parser can handle Textile, MediaWiki, TracWiki, TWiki and Confluence markup. It's extensible so that you can add new languages if you like.

The libary is Java