I'm using curl to retrieve information from wikipedia. So far I've been successful in retrieving basic text information but I really would want to retrieve it in HTML.
Here is my code:
$s = curl_init();
$url = 'http://boss.yahooapis.com/ysearch/web/v1/site:en.wikipedia.org+'.$article_name.'?appid=myID';
curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);
$rs = curl_exec($s);
$rs = Zend_Json::decode($rs);
$rs = ($rs['ysearchresponse']['resultset_web']);
$rs = array_shift($rs);
$article= str_replace('http://en.wikipedia.org/wiki/', '', $rs['url']);
$url = 'http://en.wikipedia.org/w/api.php?';
$url.='format=json';
$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);
curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);
$rs = curl_exec($s);
//curl_close( $s );
$rs = Zend_Json::decode($rs);
$rs = array_pop(array_pop(array_pop($rs)));
$rs = array_shift($rs['revisions']);
$articleText = $rs['*'];
However the text retrieved this way isnt well enough to be displayed :( its all in this kind of format
'''Aix-les-Bains''' is a [[Communes of France|commune]] in the [[Savoie]] [[Departments of France|department]] in the [[Rhône-Alpes]] [[regions of France|region]] in southeastern [[France]].
It lies near the [[Lac du Bourget]], {{convert|9|km|mi|abbr=on}} by rail north of [[Chambéry]].
==History== ''Aix'' derives from [[Latin]] ''Aquae'' (literally, "waters"; ''cf'' [[Aix-la-Chapelle]] (Aachen) or [[Aix-en-Provence]]), and Aix was a bath during the [[Roman Empire]], even before it was renamed ''Aquae Gratianae'' to commemorate the [[Emperor Gratian]], who was assassinated not far away, in [[Lyon]], in [[383]]. Numerous Roman remains survive. [[Image:IMG 0109 Lake Promenade.jpg|thumb|left|Lac du Bourget Promenade]]
How do I get the HTML of the wikipedia article?
UPDATE: Thanks but I'm kinda new to this here and right now I'm trying to run an xpath query [albeit for the first time] and can't seem to get any results. I actually need to know a couple of things here.
- How do I request just a part of an article?
- How do I get the HTML of the article requested.
I went through this url on data mining from wikipedia - it put an idea to make a second request to wikipedia api with the retrieved wikipedia text as parameters and that would retrieve the html - although it hasn't seemed to work so far :( - I don't want to just grab the whole article as a mess of html and dump it. Basically my application what it does is that you have some locations and cities pin pointed on the map - you click on the city marker and it would request via ajax details of the city to be shown in an adjacent div. This information I wish to get from wikipedia dynamically. I'll worry about about dealing with articles that don't exist for a particular city later on just need to make sure its working at this point.
Does anyone know of a nice working example that does what I'm looking for i.e. read and parse through selected portions of a wikipedia article.
According to the url provided - it says I should post the wikitext to the wikipedia api location for it to return parsed html. The issue is that if I post the information I get no response and instead an error that I'm denied access - however if I try to include the wikitext as GET it parses with no issue. But it fails of course when I have waaaaay too much text to parse.
Is this a problem with the wikipedia api? Because I've been hacking at it for two days now with no luck at all :(