views:

77

answers:

5

i want to run a program continiously on appengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to continiously keep doing it on appengine?Or will appengine kill the process?

Note:The website which will be crawled is not stored on appengine

+5  A: 

i want to run a program continiously on appengine.

Can't.

The closest you can get is background-running scheduled tasks that last no more than 30 seconds:

Notably, this means that the lifetime of a single task's execution is limited to 30 seconds. If your task's execution nears the 30 second limit, App Engine will raise an exception which you may catch and then quickly save your work or log process.

Ben S
+1 to this. You can schedule tasks to run periodically on a site, which should be enough, but you can't have continuous crawling of a page. If the page supports PubSubHubBub (http://code.google.com/p/pubsubhubbub/) or some other push technology, you could have your app subscribe to updates and crawl when the page actually changes.
Jason Hall
For crawling, I would argue that a task queue is _better_ than a single daemon. It deals with parallelism for you, and the task model suits crawling extremely well.
Nick Johnson
A: 

You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.

Peter Recore
A: 

Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.

I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.

Peter Scott
+1  A: 

A friend of mine suggested following

  • Create a task queue
  • Start the queue by passing some data.
  • Use an Exception handler and handle DeadlineExceededException.
  • In your handler create a new queue for same purpose.

You can run your job infinitely. You only need to consider used CPU Time and storage.

Manjoor
A: 

Try this:

on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.

Dingo