views:

157

answers:

2

Ok guys I am having tons of problems getting my working dev server to a working production server :). I have a task that will go through and request urls and collect and update data. It takes 30 minutes to run.

I uploaded to production server and going to the url with its corresponding .py script appname.appspot.com/tasks/rrs after 30 seconds I am getting the class google.appengine.runtime.DeadlineExceededError' Is there any way to get around this? Is this a 30 second deadline for a page? This script works fine in development server I go to the url and the associate .py script runs until completion.

import time
import random
import string
import cPickle
from StringIO import StringIO
try:
    import json
except ImportError:
    import simplejson as json 
import urllib
import pprint
import datetime
import sys
sys.path.append("C:\Program Files (x86)\Google\google_appengine")
sys.path.append("C:\Program Files (x86)\Google\google_appengine\lib\yaml\lib")
sys.path.append("C:\Program Files (x86)\Google\google_appengine\lib\webob")
from google.appengine.api import users
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext import db
class SR(db.Model):
    name = db.StringProperty()
    title = db.StringProperty()
    url = db.StringProperty()

##request url and returns JSON_data
def overview(page):
     u = urllib.urlopen(page)
     bytes = StringIO(u.read())
     ##print bytes
     u.close()
     try:
        JSON_data = json.load(bytes)
        return JSON_data
     except ValueError,e:
        print e," Couldn't get .json for %s" % page
        return None

##specific code to parse particular JSON data and append new SR objects to the given url list
def parse_json(JSON_data,lists):
    sr = SR()
    sr.name = ##data gathered
    sr.title = ##data gathered
    sr.url = ##data gathered
    lists.append(sr)
    return lists

## I want to be able to request lets say 500 pages without timeing out
page = 'someurlpage.com'##starting url
url_list = []
for z in range(0,500):
    page = 'someurlpage.com/%s'%z
    JSON_data = overview(page)##get json data for a given url page
    url_list = parse_json(JSON_data,url_list)##parse the json data and append class objects to a given list
db.put(url_list)##finally add object to gae database
+3  A: 

Yes, the App Engine imposes a 30 seconds deadline. One way around it might be a try/except DeadlineExceededError and putting the rest in a taskqueue.

But you can't make your requests run for a longer period.

You can also try Bulkupdate

Example:

class Todo(db.Model):
    page = db.StringProperty()

class BulkPageParser(bulkupdate.BulkUpdater):
    def get_query(self):
        return Todo.all()

    def handle_entity(self, entity):
        JSON_data = overview(entity.page)
        db.put(parse_json(JSON_data, [])
        entity.delete()

# Put this in your view code:
for i in range(500):
    Todo(page='someurlpage.com/%s' % i).put()

job = BulkPageParser()
job.start()
WoLpH
wtf, so lets assume I have a large list of urls I want to request in a for loop, collect data and throw in db.class instances into a list before putting. What would be the best method to do so?
Put the list of urls in a model and execute the queue with bulkupdate. Atleast... I think that would be the easiest solution ;)
WoLpH
yes but i need to repeatedly request urls and update information in a for loop not just upload the urls, I am looking for example code of someone requesting a lot of urls in a for loop that avoids the timeout error now.
Can you edit your question and add some example code to it? Than I'll try to create an example for you.
WoLpH
thx I have converted my existing code to psuedo as best as I can I think it will give a better idea of what I am trying to do
@user291071: I've added an example for you. Let's hope it works like that ;)
WoLpH
awesome, quick question though is what if I don't know all the urls ahead of time? For instance lets say that as I start I am collecting and adding urls to be visited, and there is no way to get them ahead of time.
Just add them to the `Todo` model. Calling `Todo(page=...).put()` should be enough. After that you can just run the `BulkPageParser()` again.
WoLpH
A: 

ok so if I am dynamically adding links as I am parsing the pages, I would add to the todo queue like so I believe.

def handle_entity(self, entity):
    JSON_data = overview(entity.page)
    data_gathered,new_links = parse_json(JSON_data, [])##like earlier returns the a list of sr objects, and now a list of new links/pages to go to
    db.put(data_gathered)
    for link in new_links:
        Todo(page=link).put()
    entity.delete()
@user291071: Correct :)
WoLpH
hey WoLph, another simple follow up, I have the code so far implmented but the batch is executing very quickly, how do I change the above codes so that only 1 request/batch is executed at a time. I put a 1 second delay in the handle entity, and I want only one url request every 1 second, so I need to limit my batch to 1 request. My current code seems to be doing nothing, with my PUT_BATCH_SIZE options.