tags:

views:

325

answers:

2

Hi people, i'm quite new to mediawiki, and now i have some problem I have the title of some wikipage, and i want to get just text of this page using api.php, but all that i have found in this api gets the wiki content of the page(with wiki syntax symbols, i have used this func

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test

), but i need only text, without some wikisyntax symbols... Is that possible with wiki API?

A: 

Wiki pages without any formatting symbols wouldn't really make much sense in many cases.

You can strip out the formatting yourself, if you want, but you'll break some stuff in the process.

(Unless you are creating something like a search engine, in which case you'll only need the text parts and can ignore formatting symbols completely)

Joel L
+1  A: 

I don't think it is possible using the API to get just the text.

What has worked for me was to request the HTML page (using the normal URL that you would use in a browser) and strip out the HTML tags under the content div.

EDIT:

I have had good results using HTML Parser for Java. It has examples of how to strip out HTML tags under a given DIV.

Eric Normand
I have done, the same thing, i have java app, that must recieve the text content of wiki page. When i use api, and recieve wikisyntax page it works very fast, but i need clear Text, i have tried to request the HTML page and strip out the HTML tags, but it works slowly, therefore i have asked about this feature in wiki API. Or maybe you now some good wikisyntax-clear text converter for Java, then i can convert it directly in Java?
Le_Coeur
The real issue with wikipedia's language is that it is Turing complete. If you look closely at the code of a page, you will notice all sorts of custom functions. The definitions of those functions have to be fetched as well and then interpreted, which might expand to yet more functions. That is why I reverted to html parsing, which contains the complete, rendered text.
Eric Normand