views:

408

answers:

3

Hi,

I downloaded wikipedia dump and now want to remove the wikipedia markup in the contents of each page. I tried writing regular expressions but they are too many to handle. I found a python library but I need a java library because, I want to integrate into my code.

Thank you.

+1  A: 

Mylyn WikiText can convert various Wiki syntaxes into HTML and other formats. It also supports MediaWiki syntax, which is what Wikipedia uses. Although Mylyn WikiText is primarily an Eclipse plugin, it is also available as standalone library.

Peter Štibraný
I just need a function which can remove the wiki markup from the content. I am not sure how to use mylyn to remove the markup. Can you tell me how to do it.
Algorist
@Algorist: Mylyn WikiText doesn't remove markup, it converts into other formats. I'm sorry, I have misread your question.
Peter Štibraný
+2  A: 

Do it in two steps:

  1. let some existing tool convert the MediaWiki mark-up into plain HTML;
  2. convert the plain HTML into text.

The following demo:

import net.java.textilej.parser.MarkupParser;
import net.java.textilej.parser.builder.HtmlDocumentBuilder;
import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import java.io.StringReader;
import java.io.StringWriter;

public class Test {

    public static void main(String[] args) throws Exception {

        String markup = "This is ''italic'' and '''that''' is bold. \n"+
                "=Header 1=\n"+
                "a list: \n* item A \n* item B \n* item C";

        StringWriter writer = new StringWriter();

        HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
        builder.setEmitAsDocument(false);

        MarkupParser parser = new MarkupParser(new MediaWikiDialect());
        parser.setBuilder(builder);
        parser.parse(markup);

        final String html = writer.toString();
        final StringBuilder cleaned = new StringBuilder();

        HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                public void handleText(char[] data, int pos) {
                    cleaned.append(new String(data)).append(' ');
                }
        };
        new ParserDelegator().parse(new StringReader(html), callback, false);

        System.out.println(markup);
        System.out.println("---------------------------");
        System.out.println(html);
        System.out.println("---------------------------");
        System.out.println(cleaned);
    }
}

produces:

This is ''italic'' and '''that''' is bold. 
=Header 1=
a list: 
* item A 
* item B 
* item C
---------------------------
<p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul>
---------------------------
This is  italic  and  that  is bold. Header 1 a list: item A item B item C 
Bart Kiers
+1  A: 

Try the Mediawiki text to plain text approach. You probably have to improve the PlainTextConverter class for your needs. Combined with the example for converting Wikipedia texts to HTML you can transclude template contents.

axelclk