tags:

views:

1226

answers:

4

Hi guys, I'm still stuck on my problem of trying to parse articles from wikipedia. Actually I wish to parse the infobox section of articles from wikipedia i.e. my application has references to countries and on each country page I would like to be able to show the infobox which is on corresponding wikipedia article of that country. I'm using php here - I would greatly appreciate it if anyone has any code snippets or advice on what should I be doing here.

Thanks again.


EDIT

Well I have a db table with names of countries. And I have a script that takes a country and shows its details. I would like to grab the infobox - the blue box with all country details images etc as it is from wikipedia and show it on my page. I would like to know a really simple and easy way to do that - or have a script that just downloads the information of the infobox to a local remote system which I could access myself later on. I mean I'm open to ideas here - except that the end result I want is to see the infobox on my page - of course with a little Content by Wikipedia link at the bottom :)


EDIT

I think I found what I was looking for on http://infochimps.org - they got loads of datasets in I think the YAML language. I can use this information straight up as it is but I would need a way to constantly update this information from wikipedia now and then although I believe infoboxes rarely change especially o countries unless some nation decides to change their capital city or so.

A: 

I suggest performing a WebRequest against wikipedia. From there you will have the page and you can simply parse or query out the data that you need using a regex, character crawl, or some other form that you are familiar with. Essentially a screen scrape!

Andrew Siemer
This is a waste of resources.
Matthew Flaschen
Sorry - I can see what you mean by a huge waste of resources. I didn't mean to scrape the page every time someone on your site needed to look at it. I would think that you would scrape it offline (if you chose to do so) and store that in a local DB on your applications end (way more efficient for all parties involved). Didn't mean to attract flames! :P
Andrew Siemer
@Andrew - I'm open to all possibilities - however I'm not sure how to get started. Is there any kind of working code I can look at to get started on this?
Ali
@Ali - I searched for "C# webrequest screen scrape" on google (which found loads of examples. This one should show the basics: http://www.eggheadcafe.com/community/aspnet/2/2297/screen-scraping-using-htt.aspx
Andrew Siemer
+2  A: 

I suggest you use DBPedia instead which has already done the work of turning the data in wikipedia into usable, linkable, open forms.

dajobe
This seems very promising - how do I actually use this though?
Ali
Probably start at http://linkeddata.org/tools for pointers to linked data tools. There are demos nearby too. If you just wanted the data, that's at the DBPedia download area http://wiki.dbpedia.org/Downloads32
dajobe
+2  A: 

It depends what route you want to go. Here are some possibilities:

  1. Install MediaWiki with appropriate modifications. It is a after all a PHP app designed precisely to parse wikitext...
  2. Download the static HTML version, and parse out the parts you want.
  3. Use the Wikipedia API with appropriate caching.

DO NOT just hit the latest version of the live page and redo the parsing every time your app wants the box. This is a huge waste of resources for both you and Wikimedia.

Matthew Flaschen
+2  A: 

if you want to parse one time all the articles, wikipedia has all the articles in xml format available,

http://en.wikipedia.org/wiki/Wikipedia_database

otherwise you can screen scrape individual articles e.g.

Actually I would like to grab just the infoboxes from a select list of articles.
Ali