ansaurus

Question

Answer 1

A:

There's some information on Python and XML libraries here.

If you're asking is there an existing library that's designed to parse Wiki(pedia) XML specifically and match your requirements, this is doubtful. However you can use one of the existing libraries to traverse the DOM and pull out the data you need.

Another option is to write an XSLT stylesheet that does similar and call it using lxml. This also lets you make calls to Python functions from inside the XSLT so you get the best of both worlds.

imoatama 2010-08-11 23:19:26

Answer 2

A:

I would say look at using Beautiful Soup and just get the Wikipedia page in HTML instead of using the API.

I'll try and post an example.

Zimm3r 2010-08-11 23:23:10

Answer 3

+3 A:

It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.

chaos95 2010-08-12 01:26:44

thx for the help.I tried the mwlib tutorial in the link you gave meHowever I am not sure how do I manipulate with the Article object that's returned by the simpleparse.For example how would I rebuild all of the data to xml format with their appropriate titles?

tomwu 2010-08-12 04:38:45

Answer 4

A:

Just stumbled over a library on PyPi, wikidump, that claims to provide

Tools to manipulate and extract data from wikipedia dumps

I didn't use it yet, so you are on your own to try it...

PhilS 2010-08-12 16:32:52

Answer 5

A:

You're probably looking for the Pywikipediabot for manipulating the wikipedia API.

2010-09-11 17:44:51

ansaurus

tags:

views:

answers:

Parsing a Wikipedia dump

related questions