Which module should I use to parse mediawiki text into a Perl data structure? | ansaurus

tags:

views:

140

answers:

1

+1 Q:

Which module should I use to parse mediawiki text into a Perl data structure?

I just need to parse the wikitext into Perl arrays of hashes. I found several modules. The Text::MediawikiFormat seems to be what I need, but it returns HTML, and I want a Perl data structure. I also looked at:

+3 A:

I wrote some code to do this a few years back, but it never got released because parsing mediawiki wikitext semantically is basically impossible. The problem is that mediawiki allows you to freely intermingle wikitext constructs with HTML constructs, and the official parser in mediawiki works by progressively transforming the wikitext into HTML (mostly using a horrifically complex set of regular expression substitutions).

Basically it's my opinion that mediawiki wikitext is unsuitable for any purpose besides being translated into HTML, and if you want to parse anything out of it, you're probably best off using a piece of code that translates it to HTML, and then parsing that HTML.

Postscript: Parse::MediaWikiDump is an excellent module by a good friend of mine, but it doesn't actually parse wikitext at all; it reads wikimedia dump files and extracts things like page text and titles, revision information, and the categories and links databases. It can give you the wikitext for a page, but it doesn't turn that wikitext into anything else.

hobbs 2010-04-07 01:06:10

+1 This format is horrible.

Kinopiko 2010-04-07 01:11:19

Wow, popular PHP code that's not well-designed? I'm shocked!

jrockway 2010-04-07 01:18:33

related questions

Regex to replace Boolean with bool

How do I lock a file in Perl?

How do you use XML::Parser with Style => 'Objects'

Parsing XML Elements & Attributes with Perl

Regex to match all HTML tags except <p> and </p>

Best way to extract data from a FileMaker Pro database in a script?

What's the fastest way to determine a full URL from a relative URL (given a base URL).

Why can't I connect to my CAS server with Perl's AuthCAS?

Why can't I fetch wikipedia pages with LWP::Simple?

How do I perform a Perl substitution on a string while keeping the original?

How do I read in the contents of a directory in Perl?

How can Perl's system() print the command that it's running?

How can I test STDIN without blocking in Perl?

How do I tell if a variable has a numeric value in Perl?

Why doesn't my Perl map return anything?

How can I determine the type of a blessed reference in Perl?

Parsing attributes with regex in Perl

How do you retrieve selected text using Regex in C#?

Why does the Perl conditional operator not do what I expect?

Class::DBI-like library for php?

How do you create objects in Perl?

How do I remove duplicate items from an array in Perl?

Is The Perl Journal available online?

Can you force either a scalar or array ref to be an array in Perl?

What's the safest way to iterate through the keys of a Perl hash?