ansaurus

Question

Wikipedia : Java library to remove wikipedia text markup removal

Answer 1

+1 A:

Mylyn WikiText can convert various Wiki syntaxes into HTML and other formats. It also supports MediaWiki syntax, which is what Wikipedia uses. Although Mylyn WikiText is primarily an Eclipse plugin, it is also available as standalone library.

Peter Štibraný 2010-05-19 06:27:42

I just need a function which can remove the wiki markup from the content. I am not sure how to use mylyn to remove the markup. Can you tell me how to do it.

Algorist 2010-05-19 06:42:32

@Algorist: Mylyn WikiText doesn't remove markup, it converts into other formats. I'm sorry, I have misread your question.

Peter Štibraný 2010-05-19 07:49:35

Answer 2

+2 A:

Do it in two steps:

let some existing tool convert the MediaWiki mark-up into plain HTML;
convert the plain HTML into text.

The following demo:

import net.java.textilej.parser.MarkupParser;
import net.java.textilej.parser.builder.HtmlDocumentBuilder;
import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import java.io.StringReader;
import java.io.StringWriter;

public class Test {

    public static void main(String[] args) throws Exception {

        String markup = "This is ''italic'' and '''that''' is bold. \n"+
                "=Header 1=\n"+
                "a list: \n* item A \n* item B \n* item C";

        StringWriter writer = new StringWriter();

        HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
        builder.setEmitAsDocument(false);

        MarkupParser parser = new MarkupParser(new MediaWikiDialect());
        parser.setBuilder(builder);
        parser.parse(markup);

        final String html = writer.toString();
        final StringBuilder cleaned = new StringBuilder();

        HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                public void handleText(char[] data, int pos) {
                    cleaned.append(new String(data)).append(' ');
                }
        };
        new ParserDelegator().parse(new StringReader(html), callback, false);

        System.out.println(markup);
        System.out.println("---------------------------");
        System.out.println(html);
        System.out.println("---------------------------");
        System.out.println(cleaned);
    }
}

produces:

This is ''italic'' and '''that''' is bold. 
=Header 1=
a list: 
* item A 
* item B 
* item C
---------------------------
<p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul>
---------------------------
This is  italic  and  that  is bold. Header 1 a list: item A item B item C

Bart Kiers 2010-05-19 11:26:43

Answer 3

+1 A:

Try the Mediawiki text to plain text approach. You probably have to improve the PlainTextConverter class for your needs. Combined with the example for converting Wikipedia texts to HTML you can transclude template contents.

axelclk 2010-05-19 18:49:32

ansaurus

tags:

views:

answers:

Wikipedia : Java library to remove wikipedia text markup removal

related questions