Sorry guys, I've been running a mock asking questions on how to integrate wikipedia data into my application and frankly I don't think I've had any success on my end as I've been trying all the ideas and kinda giving up when I read a dead end or obstacle. I'll try to explain what exactly I am trying to do here.
I have a simple directory of locations like cities and countries. My application is a simple php based ajax based application with a search and browse facility. People sign up and associate themselves with a city and when a user browses cities - he/ she can see people and companies in that city i.e. whoever is a part of our system that is.
That part is kinda easily set up on its own and is working fine. The thing is that My search results would be in the format i.e. some one searches for lets say Beijing. It would return in a three tabbed interface box:
- First Tab would have an infobox containig city information for Beijing
- Seond would be a country tab holding an infobox of the country information from CHina
- Third tab would have Listings of all contacts in Beijing.
The content for the first two tabs should come from Wikipedia.Now I'm totally lost with what would be the best way to get this done and furthermore once decide on a methodology then - how do I do it and make it such that its quite robust.
A couple of ideas good and bad as I have been able to digest so far are:
Run a curl request directly to wikipedia and parse the returning data everytime a search is made. There is no need to maintain a local copy in this case of the data on wikipedia. The other issue is that its wholly reliant on data from a remote third location and I doubt it is feasible to do a request everytime to wikipedia to retrieve basic information. Plus considering that data on wikipedia requires to be parsed at every request - thats gonna surmount to heavy server loads.. or am I speculating here.
Take a Download of the wikipedia dump and query that. Well I've downloaded the entire database but its gonna take forever to import all the tables from the xml dump. Plus consider the fact that I just want to extract a list of countries and cities and their info boxes - alot of the information in the dump is of no use to me.
Make my own local tables and create a cron[I'll explain why cron job here] script that would somehow parse all teh countries and cities pages on wikipedia and convert them to a format I can use in my tables. However honestly speaking I do not need all of the information in the infoboxes as is infact if I could just even get the basic markup of the infoboxes as is - that would be more than enough for me. Like:
Title of Country | Infobox Raw text
I can personally extract stuff like coordinates and other details if I want.
I even tried downloading third party datasets from infochiumps and dbpedia but the dataset from infochimps is incomplete and didn't contain all the information I wanted to display - plus with dbpedia I have absolutely no idea what to do with the csv file I downloaded of infoboxes and am afraid that it might also not be complete.
But that is just part of the issue here. I want a way to show the wikipedia information - I'll have all the links point to wikipedia as well as a nice info from wikipedia displayed properly all around BUT the issue is that I need a way that periodically I can update the information I have from wikipedia so atleast I don't have totally outdated data. Like well lets say a system that can check and if we have a new country or new location it can parse the information and somehow retrieve it. I'm relying on categories of countries and cities in wikipedia for this here but frankly all these ideas are on paper, partially coded and its a huge mess.
I'm programming in PHP and MySQL and my deadline is fast approaching - given the above situation and requirements what is the best and most practical method to follow and implement. I am totally open to ideas - practical examples if anyone has done something similar - I would love to hear :D