views:

143

answers:

1

Is it possible, using pywikipedia, to get just the text of the page, without any of the internal links or templates & without the pictures etc.?

Cheers!

+1  A: 

If you mean "I want to get the wikitext only", then look at the wikipedia.Page class, and the get method.

import wikipedia

site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...

This way you get the complete, raw wikitext from the article.

If you want to strip out the wiki syntax, as is transform [[Concept inventory]] into Concept inventory and so on, it is going to be a bit more painful.

The main reason for this trouble is that the MediaWiki wiki syntax has no defined grammar. Which makes it really hard to parse, and to strip. I currently know no software that allows you to do this accurately. There's the MediaWiki Parser class of course, but it's PHP, a bit hard to grasp, and its purpose is very very different.

But if you only want to strip out links, or very simple wiki constructs use regexes:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.

and then for piped links:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.

and so on.

But for example, there is no reliable easy way to strip out nested templates from a page. And the same goes for Images that have links in their comments. It's quite hard, and involves recursively removing the most internal link and replacing it by a marker and start over. Have a look at the templateWithParams function in wikipedia.py if you want, but it's not pretty.

NicDumZ
Clearly I misunderstood the scope of the problem. Just tried my best given that there were no other answers. :-)
cdleary