views:

138

answers:

2

Hello, I am trying to parse Wikipedia XML Dump using "Parse-MediaWikiDump-1.0.4" along with "Wikiprep.pl" script. I guess this script works fine with ver0.3 Wiki XML Dumps but not with the latest ver0.4 Dumps. I get the following error.

Can't locate object method "page" via package "Parse::MediaWikiDump::Pages" at wikiprep.pl line 390.

Also, under the "Parse-MediaWikiDump-1.0.4" documentation @ http://search.cpan.org/~triddle/Parse-MediaWikiDump-1.0.4/lib/Parse/MediaWikiDump/Pages.pm, I read "LIMITATIONS Version 0.4 This class was updated to support version 0.4 dump files from a MediaWiki instance but it does not currently support any of the new information available in those files."

Any work arounds would help me get to the next level.

Note: one may wonder why cannot we directly use SAX or STAX parser instead, wikipedia dump is a 25GB plus single file, stack/memory issues are obvious. Hence, the above perl script resolves this issue but currently I am stuck with this version problem.

+1  A: 

Any streaming parser should work just fine (DOM parsers would blow up). Try XML::Twig, just remember to flush (if you want to print out the XML) or purge (if you don't care about the XML) after every major record.

Or just use XML::Parser directly. That is what both XML::Twig and Parse::MediaWikiDump are using under the hood to parse the XML.

Chas. Owens
Thanks very much...this info will be very helpfull indeed, I appreciate it.
syed
A: 

Hello Syed, I am running into the same problem also! Have you figured out sth based on Owens reponse? Thanks