tags:

views:

83

answers:

3

I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.

Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)

Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.

+2  A: 

have a stack that you keep all the jobs on, have a handful of threads who's job it is to:

  • Pop a job off the stack
  • Check if you need to refresh the xml file (check for etags, expire headers, etc.)
  • grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
  • then process it

This way you'll be able to grab lots of data each time.

It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)

One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's

Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?

Pete Duncanson
+2  A: 

I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.

I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.

This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.

zombat
Good plan, that way nothing even happens until the data is retrieved sucessfully. Thanks for the tip!
Patrick
Now thats a nice addition, good thinking zombat.
Pete Duncanson
A: 

You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.

Matt McCormick