ansaurus

Question

What's the best performing xml parsing for GAE (Python Version)?

Answer 1

+1 A:

I know that this don't awnser my question directly, but id does what i just needed.

I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.

I used from google.appengine.api import urlfetch to get the json feeds in parallel:

class GetEntityJSON(webapp.RequestHandler):
  def post(self):
    url = 'http://url.that.generates.the.feeds/'
    if self.request.get('idList'):
      idList = self.request.get('idList').split(',')

      try:
        asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
      except urlfetch.DownloadError:
        # Dealed with time out errors (#5) as these were very frequent

      for result in asyncRequests:
        if result.status_code == 200:
          entityJSON = simplejson.loads(result.content)
          # Filled a database entity with some json info. It goes like this:
          # entity= Entity(
          # name = entityJSON['name'],
          # dateOfBirth = entityJSON['date_of_birth']
          # ).put()

    self.redirect('/')

  def _asyncFetch(self, urlList):
    rpcs = []
    for url in urlList:
      rpc = urlfetch.create_rpc(deadline = 10)
      urlfetch.make_fetch_call(rpc, url)
      rpcs.append(rpc)
    return [rpc.get_result() for rpc in rpcs]

I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.

But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.

Guilherme Coutinho 2010-07-16 21:13:07

ansaurus

tags:

views:

answers:

What's the best performing xml parsing for GAE (Python Version)?

related questions