views:

650

answers:

4

I am attempting to parse (in Java) Wikimedia markup as found on Wikipedia. There are a number of existing packages out there for this task, but I have not found any to fit my needs particularly well. The best package I have worked with is the Mathclipse Bliki parser, which does a decent job on most pages.

This parser is incomplete, however, and fails to parse certain pages or parses incorrectly on others. Sadly the code is rather messy and thus fixing the problems in this parsing engine is very time consuming and error prone.

In attempting to find a better parsing engine I have investigated using an EBNF-based parser for this task (specifically ANTLR). After some attempts however it seems that this approach isn't particularly well suited for this task, as the Wikimedia markup is relatively relaxed and thus cannot be easily fit into a structured grammar.

My experience with ANTLR and similar parsers is very limited however, so it may be my inexperience that is causing problems rather than such parsers being inherently poorly suited for this task. Can anyone with more experience on these topics weigh in here?

@Stobor: I've mentioned that I've looked at various parsing engines, including the ones returned by the google query. The best I've found so far is the Bliki engine. The problem is that fixing problems with such parsers becomes incredibly tedious, because they are all essentially long chains of conditionals and regular expressions, resulting in spaghetti code. I am looking for something more akin to the EBNF method of parsing, as that method is much clearer and more concise, and thus easier to understand and evolve. I've seen the mediawiki link you posted, and it seems to confirm my suspicions that EBNF out of the box is poorly suited for this task. Thus I am looking for a parsing engine that is clear and understandable like EBNF, but also capable of handling the messy syntax of wiki markup.

+2  A: 

You are correct Wikimedia does not lend itself to EBNF well defined grammers.

You will have to look at tools that will backtrack to be able to parse Wiki

btyacc which is a backtracking yacc. http://www.siber.com/btyacc/

You could look at Accent. Better than Yacc http://accent.compilertools.net/

Or you may have to breakdown and learn some flavour of prolog and roll you own. Whatever you do you have a interesting learning period ahead of you.

Good luck

+1  A: 

I once tried to write a parser for Boost.Quickbook, which is essentially the same as the wiki-text used by Wikipedia.

It was a very tedious process just to get some basics working, but I think it would eventually be possible to write EBNF grammar for it. If you're interested, my partial parser is available online (the grammar is embedded in doc-strings).

avakar
A: 

This answer is a little out there, but what about rendering the text and then parsing the HTML Dom in order to figure out different wiki components.

mP
+3  A: 

Parsing mediawiki content in any generic sense is pretty much impossible short using mediawiki itself. In order to parse it you need to be able fully parse HTML and CSS (since they can be embedded), and handle full template instantiation and expansion, as well as any parser addition the relevent content may have been using. That template instantiation is equivalent to a preprocessor.

It is in some senses similiar to parsing C++ except the parser also handle malformed input and arbitrary syntax additions made by parser extensions. The actual mediawiki implementation is a lot like Perl 5, the original implementation was not so bad because all the edge cases just fall out however things are linked together, but actually getting any subsequent implementation to do the same thing is really complicated, especially since the behaviors are often emergent and undocumented, not designed.

If you do not need 100% of pages to work or to be able to extract all content you might be able to cobble something together that works for you, and as you have noted there are some packages that do that. Short of knowing your actual precise needs I doubt anyone can give you a substantially better answer on how to parse it. If you need to be able to work on every page and correctly parse everything you better have a fairly large team and several years to work, and even then you still have lots of small edge cases.

So in short, no an EBNF grammer is not well suited to parsing mediawiki markup, but nothing really is...

Louis Gerbarg
Perfect, this was the answer I was looking for. Thanks!
toluju