Is there a solution to parse wikipedia xml dump file in Java?

views:

174

answers:

Is there a solution to parse wikipedia xml dump file in Java?

I am trying to parse this huge 25GB Plus wikipedia XML file. Any solution that will help would be appreciated. Preferably a solution in Java.

+3 A:

Ofcourse it's possible to parse huge XML files with Java, but you should use the right kind of XML parser - for example a SAX parser which processes the data element by element, and not a DOM parser which tries to load the whole document into memory.

It's impossible to give you a complete solution because your question is very general and superficial - what exactly do you want to do with the data?

Jesper 2010-05-20 09:39:01

+1 A:

If you don't intend to write or change anything in that xml, consider using SAX. It keeps in memory one node at a time (instead of DOM, which tries to build the whole tree in the memory).

folone 2010-05-20 09:41:48

+3 A:

A Java API to parse Wikipedia XML dumps: WikiXMLJ

cubanacan 2010-05-20 10:25:06

+1 A:

I would go with StAX as it provides more flexibility than SAX (also good option).

binary_runner 2010-05-20 11:17:06

+1 A:

Yep, right. Do not use DOM. If you want to read small amount of data only, and want to store in your own POJO then you can use XSLT transformation also.

Transforming data into XML format which is then converted to some POJO using Castor/JAXB (XML to ojbect libraries).

Please share how you solve the problem so others can have better approach.

thanks.

--- EDIt ---

Check the links below for better comparison between different parsers. It seems that STAX is better because it has control over the parser and it pulls data from parser when needed.

http://java.sun.com/webservices/docs/1.6/tutorial/doc/SJSXP2.html

http://tutorials.jenkov.com/java-xml/sax-vs-stax.html

Paarth 2010-05-20 11:44:50

ansaurus

tags:

views:

answers:

Is there a solution to parse wikipedia xml dump file in Java?

related questions