views:

169

answers:

5

I'm writing an aggregation application which scrapes data from a couple of web sources and displays that data with a novel interface. The sites from which I'm scraping update every couple of minutes, and I want to make sure the data on my aggregator is up-to-date.

What's the best way to periodically submit fresh data to my App Engine application from an automated script?

Constraints:

  1. The application is written in Python.

  2. The scraping process for each site takes longer than one second, thus I cannot process the data in an App Engine handler.

  3. The host on which the updater script would run is shared, so I'd rather not store my password on disk.

  4. I'd like to check the code for the application into to our codebase. While my associates aren't malicious, they're pranksters, and I'd like to prevent them from inserting fake data into my app.

  5. I'm aware that App Engine supports some remote_api thingey, but I'd have to put that entry point behind authentication (see constraint 3) or hide the URL (see constraint 4).

Suggestions?

A: 

The only way to get data into AppEngine is to call up a Web app of yours (as a Web app) and feed it data through the usual HTTP-ish means, i.e. as parameters to a GET request (for short data) or to a POST (if long or binary).

In other words, you'll have to craft your own little dataloader, which you will access as a Web app and which will in turn stash the data into the database behind AppEngine.

You'll probably want at least password protection on that app so nobody loads bogus data into your app.

Carl Smotricz
+3  A: 

Write a Task Queue task or an App Engine cron job to handle this. I'm not sure where you heard that there's a limit of 1 second on any sort of App Engine operations - requests are limited to 30 seconds, and URL fetches have a maximum deadline of 10 seconds.

Nick Johnson
Ah! I thought the limit was 1 second. Thanks!
a paid nerd
A: 

Can you break up the scraping process into independent chunks that can each finish in the timeframe of an appengine request? (which can run longer than one second btw). Then you can just spawn a bunch of tasks using the task API that when combined, accomplish the full scrape. Then use the cron API to spawn off those tasks every N minutes.

Peter Recore
A: 

I asked around and some friends came up with two solutions:

  • Upload a file with a shared secret token along with the application, but when committing to the codebase, change the token.

  • Create a small datastore model with one row, a secret token.

In both cases the token can be used to authenticate POST requests used to upload new data.

a paid nerd
A: 

App engine has tools to upload data. Refer to http://code.google.com/appengine/docs/python/tools/uploadingdata.html