I am attempting to parse (in Java) Wikimedia markup as found on Wikipedia. There are a number of existing packages out there for this task, but I have not found any to fit my needs particularly well. The best package I have worked with is the Mathclipse Bliki parser, which does a decent job on most pages.
This parser is incomplete, however, and fails to parse certain pages or parses incorrectly on others. Sadly the code is rather messy and thus fixing the problems in this parsing engine is very time consuming and error prone.
In attempting to find a better parsing engine I have investigated using an EBNF-based parser for this task (specifically ANTLR). After some attempts however it seems that this approach isn't particularly well suited for this task, as the Wikimedia markup is relatively relaxed and thus cannot be easily fit into a structured grammar.
My experience with ANTLR and similar parsers is very limited however, so it may be my inexperience that is causing problems rather than such parsers being inherently poorly suited for this task. Can anyone with more experience on these topics weigh in here?
@Stobor: I've mentioned that I've looked at various parsing engines, including the ones returned by the google query. The best I've found so far is the Bliki engine. The problem is that fixing problems with such parsers becomes incredibly tedious, because they are all essentially long chains of conditionals and regular expressions, resulting in spaghetti code. I am looking for something more akin to the EBNF method of parsing, as that method is much clearer and more concise, and thus easier to understand and evolve. I've seen the mediawiki link you posted, and it seems to confirm my suspicions that EBNF out of the box is poorly suited for this task. Thus I am looking for a parsing engine that is clear and understandable like EBNF, but also capable of handling the messy syntax of wiki markup.