Is it possible, using pywikipedia, to get just the text of the page, without any of the internal links or templates & without the pictures etc.?
Cheers!
Is it possible, using pywikipedia, to get just the text of the page, without any of the internal links or templates & without the pictures etc.?
Cheers!
If you mean "I want to get the wikitext only", then look at the wikipedia.Page
class, and the get
method.
import wikipedia
site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')
print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...
This way you get the complete, raw wikitext from the article.
If you want to strip out the wiki syntax, as is transform [[Concept inventory]]
into Concept inventory and so on, it is going to be a bit more painful.
The main reason for this trouble is that the MediaWiki wiki syntax has no defined grammar. Which makes it really hard to parse, and to strip. I currently know no software that allows you to do this accurately. There's the MediaWiki Parser class of course, but it's PHP, a bit hard to grasp, and its purpose is very very different.
But if you only want to strip out links, or very simple wiki constructs use regexes:
text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.
and then for piped links:
text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.
and so on.
But for example, there is no reliable easy way to strip out nested templates from a page. And the same goes for Images that have links in their comments. It's quite hard, and involves recursively removing the most internal link and replacing it by a marker and start over. Have a look at the templateWithParams
function in wikipedia.py if you want, but it's not pretty.