views:

50

answers:

1

I think we all know this page, but the benchmarks provided dated from more than two years ago. So, I would like to know if you could point out the best xml parser around. As I need just a xml parser, the more important thing to me is speed over everything else. My objective is to process some xml feeds (about 25k) that are 4kb in size (this will be a daily task). As you probably know, I'm restricted by the 30 seconds request timeout. So, what's the best parser (Python only) that I can use?

Thanks for your anwsers.

Edit 01:

@Peter Recore I'll. I'm writing some code now and plan to run some profiling in the near future. Regarding your question, the answer is no. Processing takes just a little time when compared with downloading the actual xml feed. But, I can't increase Google's Bandwidth, so I can only focus on this right now.

My only problem is that i need to do this as fastest as possible because my objective is to get a snapshot of a website status. And, as internet is live and people keep adding and changing it's data, i need the fastest method because any data insertion during the "downloading and processing" time span will actually mess with my statistical analisys.

I used to do it from my own computer and the process took 24 minutes back then, but now the website has 12 times more information.

+1  A: 

I know that this don't awnser my question directly, but id does what i just needed.

I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.

I used from google.appengine.api import urlfetch to get the json feeds in parallel:

class GetEntityJSON(webapp.RequestHandler):
  def post(self):
    url = 'http://url.that.generates.the.feeds/'
    if self.request.get('idList'):
      idList = self.request.get('idList').split(',')

      try:
        asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
      except urlfetch.DownloadError:
        # Dealed with time out errors (#5) as these were very frequent

      for result in asyncRequests:
        if result.status_code == 200:
          entityJSON = simplejson.loads(result.content)
          # Filled a database entity with some json info. It goes like this:
          # entity= Entity(
          # name = entityJSON['name'],
          # dateOfBirth = entityJSON['date_of_birth']
          # ).put()

    self.redirect('/')

  def _asyncFetch(self, urlList):
    rpcs = []
    for url in urlList:
      rpc = urlfetch.create_rpc(deadline = 10)
      urlfetch.make_fetch_call(rpc, url)
      rpcs.append(rpc)
    return [rpc.get_result() for rpc in rpcs]

I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.

But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.

Guilherme Coutinho