views:

25

answers:

1

Quick intro

I've built a system which requests stats from social network apis for 1000s of different subjects every 20mins. So I do a call to each social network for each subject. This means im making 1000s of http requests for each 20 mins slot. The results are then processed in a separate task.

Current solution

I'm running php from the command line being invoked periodically from supervisor. Data is then saved to Mysql.

Lots of issues!

As php can't multi-thread or utilise asynchronous http requests, the api scripts are taking a long time to fetch the data from the social networks one connection at a time.

As my data model for the 'subjects' gets more complicated I may need to start joining tables and also need to have multiple servers.

Future

More and more subjects to be added, analysis tools with lots of number crunching.

I would be really interested to hear what other people are using with this kind of domain. E.g. platform / language / libraries / database / daemon tools etc

John

A: 

I've built a system which requests stats from social network apis for 1000s of different subjects every 20mins. So I do a call to each social network for each subject. This means im making 1000s of http requests for each 20 mins slot. The results are then processed in a separate task.

First problem is here - you are polling based on a subject regardless of whether that subject has been updated in the interval. You may find it significantly more efficient to poll the new articles since the last poll and filter out the stuff you're interested in.

As php can't multi-thread

Why do think you need multi-threading to run more than one instance of a php script? Define a common datastore containing details of what work needs to be done and a way of partitioning the requests over your prefered number of instances and write a script which starts up this number of instances passing a partition identifier to each one.

or utilise asynchronous http requests

The cURL extension can.

I may need to start joining tables

! OMG ! You must be some kind of computer genius! Can I buy shares in your company!

Seriously - "joining tables" has nothing at all to do with any solution to the problems you've described. "Multiple servers" will do nothing to solve your data complexity issues (but would help with real performance issues).

symcbean
ok ok ok I waffled with the joining tables bit. All Im saying is perhaps mongodb is a better option for scaling as I can use embedded documents rather than joining massive tables. I'm not an idiot - I promise :)I need to poll the subjects every time as Im measuring the difference in activity from the last timeI didn't know that curl could do that - thanks!
John