views:

122

answers:

3

i have the following script i am using to scrap data from my uni website and insert into a GAE Db

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import datetime

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
viewurl  = "http://localhost:8000/timekeeper/intake/checkintake/"
inserturl = "http://localhost:8000/timekeeper/intake/addintake/"
print url
mech =  Browser()
try:
    page = mech.open(url)
    html = page.read()
except Exception, err:
    print str(err)
#print html 
soup = BeautifulSoup(html)
soup.prettify() 
tables  = soup.find('select')
for options in tables:
    intake = options.string
    #print intake
    try:
        #print viewurl+intake
        page = mech.open(viewurl+intake)
        html = page.read()
        print html
        if html=="Exist in database":
            print intake, " Exist in the database skiping"
        else:
            page = mech.open(inserturl+intake)
            html = page.read()
            print html
            if html=="Ok":
                print intake, "added to the database"
            else:
                print "Error adding ",  intake, " to database"
    except Exception, err:
        print str(err)

i am wondering what would be the best way to optimize this script so i can run it on the app engine servers. as it is, it is now scraping over 300 entries and take well over 10 mins to insert all the data on my local machine

the model that is being used to store the data is

class Intake(db.Model):
    intake=db.StringProperty(multiline=False, required=True)
    #@permerlink    
    def get_absolute_url(self):
        return "/timekeeper/%s/" % self.intake
    class Meta:
        db_table = "Intake"
        verbose_name_plural = "Intakes"
        ordering = ['intake']
+4  A: 

Divide and conquer.

  1. Make a list of tasks (e.g. urls to scrape/parse)
  2. Add your tasks into a queue (appengine taskqueue api, amazon sqs, …)
  3. Process your queue
tosh
+1 for the Divide and Conquer strategy: can't miss !
jldupont
it is only scraping from one url. what is being scraped is a list of options that is over 300 entries. so if i divide it into tasks its going to be two tasks. i am guessing one for scraping and one for insertion. would this work with out hitting the quota limits? or is their another way to optimize. total noob here so please bear with me
nashr rafeeg
In the code you pasted, you're doing repeated requests inside a for loop - so it's not just one fetch.
Nick Johnson
Wait... you're making those HTTP requests to update the data. Removing those will cut a _lot_ off your runtime.
Nick Johnson
basically you have to divide your problem into tasks that are small enough to handle in one request. It seems like you are trying to do 300+ requests (one per option) in one request. I would not be surprised if that takes too long. So I'd do one request for the list of options, build the urls to parse and queue the parsing tasks up. Then do the scraping one url/option at a time.
tosh
Of course as Nick Johnson already mentioned if you are hosting the script on appengine you will be able to access the datastore without the need for http round trips :) I thought these requests were accessing external applications.
tosh
+2  A: 

The first thing you should do is rewrite your script to use the App Engine datastore directly. A large part of the time you're spending is undoubtedly because you're using HTTP requests (two per entry!) to insert data into your datastore. Using the datastore directly with batch puts ought to cut a couple of orders of magnitude off your runtime.

If your parsing code is still too slow, you can cut the work up into chunks and use the task queue API to do the work in multiple requests.

Nick Johnson
+1  A: 

hi according to tosh and nick i have modified the script as bellow

from google.appengine.api import urlfetch
from BeautifulSoup import BeautifulSoup
from timkeeper.models import Intake
from google.appengine.ext import db

__author__ = "Nash Rafeeq" 

url  = "http://webspace.apiit.edu.my/schedule/timetable.jsp"
try:
    page = urlfetch.fetch(url)
    #print html 
    soup = BeautifulSoup(page.content)
    soup.prettify() 
    tables  = soup.find('select')
    models=[]
    for options in tables:
        intake_code = options.string
        if Intake.all().filter('intake',intake_code).count()<1:
            data = Intake(intake=intake_code)
            models.append(data)
    try:
        if len(models)>0:
            db.put(models)
        else:
            pass 
    except Exception,err:
        pass
except Exception, err:
    print str(err)

am i on the right track ? also i am not really sure how to get this to invoke on a schedule (once a week) what would be the best way to do it?

and thanks for the prompt answers

nashr rafeeg
You might want to look into app engine's cron service. http://code.google.com/appengine/docs/python/config/cron.html
tosh