views:

110

answers:

2

Hi,

I have a system that connects to 2 popular APIs. I need to aggregate the data from each into a unified result that can then be paginated. The scope of the project means that the system could end up supporting 10's of APIs.

Each API imposes a max limit of 50 results per request.

What is the best way of aggregating this data so that it is reliable i.e ordered, no duplicates etc

I am using CakePHP framework on a LAMP environment, however, I think this question relates to all programming languages.

My approach so far is to query the search API of each provider and then populate a MySQL table. From this the results are ordered, paginated etc. However, my concern is performance: API communication, parsing, inserting and then reading all in one execution.

Am I missing something, does anyone have any other ideas? I'm sure this is a common problem with many alternative solutions.

Any help would be greatly appreciated.

Thanks, Paul

A: 

Yes, this is a common problem.

Search SO for questions like http://stackoverflow.com/search?q=%5Bphp%5D+background+processing

Everyone who tries this realizes that calling other sites for data is slow. The first one or two seem quick, but other sites break (and your app breaks) and other sites are slow (and your app is slow)

You have to disconnect the front-end from the back-end.

Choice 1 - pre-query the data with a background process that simply gets and loads the database.

Choice 2 - start a long-running background process and check back from a JavaScript function to see if it's done yet.

Choice 3 - the user's initial request spawns the background process -- you then email them a link so they can return when the job is done.

S.Lott
A: 

i have a site doing just that with over 100 rss/atom feeds, this is what i do:

  1. i have a list of feeds and a cron job that iterates over them, about 5 feeds a minute, meaning i cycle through all feeds every 20 minute or so.
  2. i lift the feed, and try to insert each entry into the database, using the url as a unique field, if the url exists, i do not insert. the entry date is my current system clock, and is inserted by my application, as date fields in rss cannot be trusted, and in some cases, can't even be parsed.
  3. for some feeds, and only experiece can tell you which, i also search for duplicate titles, some websites change the urls for their own reasons.
  4. the items are now all placed in the same database table, ready to be queried.

one last thought: if your application is likely to have new feeds added while in production, you really should also check if a feed is "new" (ie: has no previous entries in the db), if it is, you should mark all currently available links as inactive, otherwise, when you add a feed, there will be a block of articles from that feed, all with the same date and time. (simply put: the method i described is for future additions to the feed only, past articles will not be available).

hope this helps.

Nir Gavish
one other though: if you are unable to use cron for any reason, you can always use web based pinging services, such as http://www.watchour.com/ (which, i feel i have to divulge, is run by a friend of mine). please feel free to contact me directly, if you think i can help any.
Nir Gavish
Thanks for the great comments and advice. I should have mentioned that the app I am building is a search tool. I use the Search API and need to display the results realtime. Think car insurance comparisons websites. I'm thinking a page that the user waits on whilst calculations are performing may be the best with a retrieval system. What do you think?
Mindblip
if, as you said, this could amount to dozens of seperate api's being aggregated, you cannot do this on page-load, except with the most captive of audiences. no "normal" internet user will be willing to wait more than 6-7 seconds.
Nir Gavish