views:

391

answers:

4

Sorry guys, I've been running a mock asking questions on how to integrate wikipedia data into my application and frankly I don't think I've had any success on my end as I've been trying all the ideas and kinda giving up when I read a dead end or obstacle. I'll try to explain what exactly I am trying to do here.

I have a simple directory of locations like cities and countries. My application is a simple php based ajax based application with a search and browse facility. People sign up and associate themselves with a city and when a user browses cities - he/ she can see people and companies in that city i.e. whoever is a part of our system that is.

That part is kinda easily set up on its own and is working fine. The thing is that My search results would be in the format i.e. some one searches for lets say Beijing. It would return in a three tabbed interface box:

  1. First Tab would have an infobox containig city information for Beijing
  2. Seond would be a country tab holding an infobox of the country information from CHina
  3. Third tab would have Listings of all contacts in Beijing.

The content for the first two tabs should come from Wikipedia.Now I'm totally lost with what would be the best way to get this done and furthermore once decide on a methodology then - how do I do it and make it such that its quite robust.

A couple of ideas good and bad as I have been able to digest so far are:

  1. Run a curl request directly to wikipedia and parse the returning data everytime a search is made. There is no need to maintain a local copy in this case of the data on wikipedia. The other issue is that its wholly reliant on data from a remote third location and I doubt it is feasible to do a request everytime to wikipedia to retrieve basic information. Plus considering that data on wikipedia requires to be parsed at every request - thats gonna surmount to heavy server loads.. or am I speculating here.

  2. Take a Download of the wikipedia dump and query that. Well I've downloaded the entire database but its gonna take forever to import all the tables from the xml dump. Plus consider the fact that I just want to extract a list of countries and cities and their info boxes - alot of the information in the dump is of no use to me.

  3. Make my own local tables and create a cron[I'll explain why cron job here] script that would somehow parse all teh countries and cities pages on wikipedia and convert them to a format I can use in my tables. However honestly speaking I do not need all of the information in the infoboxes as is infact if I could just even get the basic markup of the infoboxes as is - that would be more than enough for me. Like:

Title of Country | Infobox Raw text

I can personally extract stuff like coordinates and other details if I want.

I even tried downloading third party datasets from infochiumps and dbpedia but the dataset from infochimps is incomplete and didn't contain all the information I wanted to display - plus with dbpedia I have absolutely no idea what to do with the csv file I downloaded of infoboxes and am afraid that it might also not be complete.

But that is just part of the issue here. I want a way to show the wikipedia information - I'll have all the links point to wikipedia as well as a nice info from wikipedia displayed properly all around BUT the issue is that I need a way that periodically I can update the information I have from wikipedia so atleast I don't have totally outdated data. Like well lets say a system that can check and if we have a new country or new location it can parse the information and somehow retrieve it. I'm relying on categories of countries and cities in wikipedia for this here but frankly all these ideas are on paper, partially coded and its a huge mess.

I'm programming in PHP and MySQL and my deadline is fast approaching - given the above situation and requirements what is the best and most practical method to follow and implement. I am totally open to ideas - practical examples if anyone has done something similar - I would love to hear :D

+2  A: 

A couple things I can think of:

  1. Just display the wikipedia data in an iframe on your site.

  2. Use Curl to get the html from wikipedia, then use a custom stylesheet to style it and/or hide the parts you don't want displayed.

Trying to actually parse the HTML and pull out the pieces you want is going to be a giant pain, and is most likely going to have to be custom for each city. Better off getting something simple working for now then going back and improving it later if you decide you really need to.

Eric Petroelje
You can propose ur two solutions as 2 answers so it is easier to vote up
blntechie
The iframe idea I doubt would work as I need to just display the infobox part of an article in the search results for that city or location. I heard that wikipedia restricts external requests or so - how true is that and what kind of restrictions are there on wikipedia with respect to grabbing information in this manner.
Ali
+2  A: 

How about using one of the Wikipedia Geocoding Webservices

There are several available where you can pass in e.g. postalcode and country to a short article summary and a link to the wikipedia article.

If that would be enough.

jitter
+1  A: 

I'd suggest the following

  • Query the city from wikipedia when it (the city) is created in your DB
  • Parse the data, store a local copy with the timestamp of the last update
  • on access, update the data if it is necessary. You can display the old one with a watermark saying it is ... days old and now updating. Then change to the freshly aquired one when the update is done. You've said you are using AJAX, so it won't be a problem

It would minimize the queryes to wikipedia and your service won't show empty pages even when wikipedia is unreachable.

Csaba Kétszeri
Actually the thing is that I would need a list of all the cities that are available on wikipedia to begin with - I'm counting on using categories which have grouped cities on one page. However the question is that how can I extract selected information from wikipedia like when you say query the city how do I do that considering we're talking about thousands of entries.
Ali
All the answers have been great - I'm using your concept by retrieving all the select xml exports of the articles I want and running a script to parse from the exported files for the infobox - so far its working like a charm. Thanks everybody
Ali
A: 

Have a look at DBPedia it contains nice extraction of Wikipedia data in CSV format.

WeShallOvercome