views:

881

answers:

4

I'm using curl to retrieve information from wikipedia. So far I've been successful in retrieving basic text information but I really would want to retrieve it in HTML.

Here is my code:

$s = curl_init();    

$url = 'http://boss.yahooapis.com/ysearch/web/v1/site:en.wikipedia.org+'.$article_name.'?appid=myID';
curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);

$rs = Zend_Json::decode($rs);

$rs = ($rs['ysearchresponse']['resultset_web']);

$rs = array_shift($rs);
$article= str_replace('http://en.wikipedia.org/wiki/', '', $rs['url']);

$url = 'http://en.wikipedia.org/w/api.php?';
$url.='format=json';
$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);
//curl_close( $s );
$rs = Zend_Json::decode($rs);

$rs = array_pop(array_pop(array_pop($rs)));
$rs = array_shift($rs['revisions']);
$articleText = $rs['*'];

However the text retrieved this way isnt well enough to be displayed :( its all in this kind of format

'''Aix-les-Bains''' is a [[Communes of France|commune]] in the [[Savoie]] [[Departments of France|department]] in the [[Rhône-Alpes]] [[regions of France|region]] in southeastern [[France]].

It lies near the [[Lac du Bourget]], {{convert|9|km|mi|abbr=on}} by rail north of [[Chambéry]].

==History== ''Aix'' derives from [[Latin]] ''Aquae'' (literally, "waters"; ''cf'' [[Aix-la-Chapelle]] (Aachen) or [[Aix-en-Provence]]), and Aix was a bath during the [[Roman Empire]], even before it was renamed ''Aquae Gratianae'' to commemorate the [[Emperor Gratian]], who was assassinated not far away, in [[Lyon]], in [[383]]. Numerous Roman remains survive. [[Image:IMG 0109 Lake Promenade.jpg|thumb|left|Lac du Bourget Promenade]]

How do I get the HTML of the wikipedia article?


UPDATE: Thanks but I'm kinda new to this here and right now I'm trying to run an xpath query [albeit for the first time] and can't seem to get any results. I actually need to know a couple of things here.

  1. How do I request just a part of an article?
  2. How do I get the HTML of the article requested.

I went through this url on data mining from wikipedia - it put an idea to make a second request to wikipedia api with the retrieved wikipedia text as parameters and that would retrieve the html - although it hasn't seemed to work so far :( - I don't want to just grab the whole article as a mess of html and dump it. Basically my application what it does is that you have some locations and cities pin pointed on the map - you click on the city marker and it would request via ajax details of the city to be shown in an adjacent div. This information I wish to get from wikipedia dynamically. I'll worry about about dealing with articles that don't exist for a particular city later on just need to make sure its working at this point.

Does anyone know of a nice working example that does what I'm looking for i.e. read and parse through selected portions of a wikipedia article.


According to the url provided - it says I should post the wikitext to the wikipedia api location for it to return parsed html. The issue is that if I post the information I get no response and instead an error that I'm denied access - however if I try to include the wikitext as GET it parses with no issue. But it fails of course when I have waaaaay too much text to parse.

Is this a problem with the wikipedia api? Because I've been hacking at it for two days now with no luck at all :(

A: 

As far as I understand it, the Wikipedia software converts the Wiki markup into HTML when the page is requested. So using your current method, you'll need to deal with the results.

A good place to start is the Mediawiki API. You can also use http://pear.php.net/package/Text_Wiki to format the results retrieved via cURL.

Robert S.
That link to Text_Wiki isn't working for me, something weird with the underscore?
Matt G
I fixed it. :) Hope that works better.
Robert S.
A: 

Try looking at the printable version of the desired Wikipedia article in question.

In other words, change this line of your source code:

$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

to something like:

$url.=sprintf('&action=query&titles=%s&printable=yes&redirects=1', $article);

Disclaimer: Have not tested, and this is just a guess at how your API might work.

HanClinto
A: 

There is a PEAR Wiki Filter that I have used and it does a very decent job.

Text Wiki

Phil

Phil Carter
It probably won't render Wikipedia's myriad templates correctly, will it? (to do so, you'd either have to have copies of the templates locally, or it would have to fetch them from wikipedia)
Frank Farmer
I know it will do the standard Wiki Mark Up, it's managed all the content I've ever put through it, so couldn't say with authority if it can do the templates or not. What the OP pasted was Wiki mark up and that will be converted.
Phil Carter
What the OP pasted included "{{convert|9|km|mi|abbr=on}}", which is a template call.
Matt G
+5  A: 

The simplest solution would probably be to grab the page itself (e.g. http://en.wikipedia.org/wiki/Combination ) and then extract the content of <div id="content">, potentially with an xpath query.

Frank Farmer
Nice idea - how would I do this I mean should I open a socket to the page? Also the issue is that I need to get portions of a page and sections as opposed to a full html dump of the content.
Ali