ansaurus

Question

python script optimization for app engine

Answer 1

+4 A:

Divide and conquer.

Make a list of tasks (e.g. urls to scrape/parse)
Add your tasks into a queue (appengine taskqueue api, amazon sqs, …)
Process your queue

tosh 2009-11-27 15:27:53

+1 for the Divide and Conquer strategy: can't miss !

jldupont 2009-11-27 15:37:48

it is only scraping from one url. what is being scraped is a list of options that is over 300 entries. so if i divide it into tasks its going to be two tasks. i am guessing one for scraping and one for insertion. would this work with out hitting the quota limits? or is their another way to optimize. total noob here so please bear with me

nashr rafeeg 2009-11-27 15:41:41

In the code you pasted, you're doing repeated requests inside a for loop - so it's not just one fetch.

Nick Johnson 2009-11-27 15:47:41

Wait... you're making those HTTP requests to update the data. Removing those will cut a _lot_ off your runtime.

Nick Johnson 2009-11-27 15:51:25

basically you have to divide your problem into tasks that are small enough to handle in one request. It seems like you are trying to do 300+ requests (one per option) in one request. I would not be surprised if that takes too long. So I'd do one request for the list of options, build the urls to parse and queue the parsing tasks up. Then do the scraping one url/option at a time.

tosh 2009-11-27 15:57:03

Of course as Nick Johnson already mentioned if you are hosting the script on appengine you will be able to access the datastore without the need for http round trips :) I thought these requests were accessing external applications.

tosh 2009-11-27 16:02:10

Answer 2

+2 A:

The first thing you should do is rewrite your script to use the App Engine datastore directly. A large part of the time you're spending is undoubtedly because you're using HTTP requests (two per entry!) to insert data into your datastore. Using the datastore directly with batch puts ought to cut a couple of orders of magnitude off your runtime.

If your parsing code is still too slow, you can cut the work up into chunks and use the task queue API to do the work in multiple requests.

Nick Johnson 2009-11-27 15:47:06

Answer 3

+1 A:

hi according to tosh and nick i have modified the script as bellow

from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
    page = urlfetch.fetch(url)
    #print html 
    soup = BeautifulSoup(page.content)
    soup.prettify() 
    tables  = soup.find('select')
    models=[]
    for options in tables:
        intake_code = options.string
        if Intake.all().filter('intake',intake_code).count()<1:
            data = Intake(intake=intake_code)
            models.append(data)
    try:
        if len(models)>0:
            db.put(models)
        else:
            pass 
    except Exception,err:
        pass
except Exception, err:
    print str(err)

am i on the right track ? also i am not really sure how to get this to invoke on a schedule (once a week) what would be the best way to do it?

and thanks for the prompt answers

nashr rafeeg 2009-11-27 16:57:25

You might want to look into app engine's cron service. http://code.google.com/appengine/docs/python/config/cron.html

tosh 2009-11-27 17:35:42

ansaurus

tags:

views:

answers:

python script optimization for app engine

related questions