views:

234

answers:

5

I'm working on a Python script that transforms this:

foo
bar

Into this:

[[Component foo]]
[[bar]]

The script checks (per input line) if the page "Component foo" exists. If it exists then a link to that page is created, if it doesn't exist then a direct link is created.

The problem is that I need a quick & cheap way to check if a lot of wiki pages exist.I don't want to (try to) download all the 'Component' pages.

I already figured out a fast way to do this by hand: Edit a new wiki page. paste all the 'component' links into the page, press preview, and then save the resulting preview HTML page. The resulting HTML file contains a different link for existing pages than for non-existing pages.

So to rephrase my question: How can I save a mediawiki preview page in Python?

(I don't have local access to the database.)

+3  A: 

If you have local access to the wiki database, it might be easiest to do a query against the database to see whether each page exists.

If you only have HTTP access, you might try the mechanize library which lets you programmatically automate tasks that would otherwise require a browser.

Greg Hewgill
A: 

You should be able to use the MediaWiki API. http://www.mediawiki.org/wiki/API (maybe under Queries or Creating/Editing)

I'm not too familiar with it, but for example, you could compare the output of an existing page with a nonexistent page.

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Bill_Gates&rvprop=timestamp

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=NONEXISTENT_PAGE&rvprop=timestamp

Gabe
+2  A: 

You can definitely use the API to check if a page exists:

#Assuming words is a list of words you wish to query for
import urllib

# replace en.wikipedia.org with the address of the wiki you want to access
query = "http://en.wikipedia.org/w/api?action=query&titles=%s&format=xml" % "|".join(words)
pages = urllib.urlopen(query)

Now pages you will contain xml like this:

<?xml version="1.0"?><api><query><pages>

   <page ns="0" title="DOESNOTEXIST" missing="" />

   <page pageid="600799" ns="0" title="FOO" />

   <page pageid="11178" ns="0" title="Foobar" />

</pages></query></api>

Pages which don't exist will appear here but they have the missing="" attribute set, as can be seen above. You can also check for the invalid attribute to be on the save side.

Now you can use your favorite xml parser to check for these attributes and react accordingly.

See also: http://www.mediawiki.org/wiki/API:Query

Garns
A: 

Since the pages are stored in the database, you will have to access that one way or another. Since you don't have local access the API, as suggested, is probably it - but there may be alternatives.

http://www.mwusers.com/forums/forum.php

Seems to be THE place for questions like this. I seen questions requiring intimate knowledge of mediawiki's internals answered quickly and comprehensively on this forum.

mickeyf
+2  A: 

Use pywikipedia to interact with the MediaWiki software. It's probably the most powerful bot framework available.

poke