views:

141

answers:

3

Hi,

I downloaded wikipedia dump and want to convert from wiki format to my object format. Is there a wiki parser available that converts the object into xml.

Thank you

+1  A: 

This might help: a page with converters from mediawiki to other formats, including docbook. Docbook is a standard xml based format that might fit your needs (xml representation of mediawiki content)

Andreas_D
+3  A: 

See java-wikipedia-parser. I have never used it but according to the docs :

The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the be.devijver.wikipedia.Visitor interface.

dogbane
+2  A: 

I do not know how exactly looks xml format of Wikipedia dump. But, if a part of the text is in Wikipedia markup, I suggest to investigate http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html. This is one of the classes of a Wikipedia package for apache lucene. I didn't use it but apache lucene is a quite mature project, so it is worth to try its -- in this case experimental -- package.

Skarab