tags:

views:

140

answers:

1

I just need to parse the wikitext into Perl arrays of hashes. I found several modules. The Text::MediawikiFormat seems to be what I need, but it returns HTML, and I want a Perl data structure. I also looked at:

+3  A: 

I wrote some code to do this a few years back, but it never got released because parsing mediawiki wikitext semantically is basically impossible. The problem is that mediawiki allows you to freely intermingle wikitext constructs with HTML constructs, and the official parser in mediawiki works by progressively transforming the wikitext into HTML (mostly using a horrifically complex set of regular expression substitutions).

Basically it's my opinion that mediawiki wikitext is unsuitable for any purpose besides being translated into HTML, and if you want to parse anything out of it, you're probably best off using a piece of code that translates it to HTML, and then parsing that HTML.

Postscript: Parse::MediaWikiDump is an excellent module by a good friend of mine, but it doesn't actually parse wikitext at all; it reads wikimedia dump files and extracts things like page text and titles, revision information, and the categories and links databases. It can give you the wikitext for a page, but it doesn't turn that wikitext into anything else.

hobbs
+1 This format is horrible.
Kinopiko
Wow, popular PHP code that's not well-designed? I'm shocked!
jrockway