ansaurus

Question

Python: simple async download of url content?

Answer 1

A:

I'm not sure I'm understanding your question, so I'll give multiple partial answers to start with.

If your concern is that web.py is having to download data from somewhere and analyze the results before responding, and you fear the request may time out before the results are ready, you could use ajax to split the work up. Return immediately with a container page (to hold the results) and a bit of javascript to poll the sever for the results until the client has them all. Thus the client never waits for the server, though the user still has to wait for the results.
If your concern is tying up the server waiting for the client to get the results, I doubt if that will actually be a problem. Your networking layers should not require you to wait-on-write
If you are worrying about the server waiting while the client downloads static content from elsewhere, either ajax or clever use of redirects should solve your problem

MarkusQ 2009-03-21 00:39:24

The issue with the ajax solution is cross-domain restrictions - I can't grab content from pages not coming from the originating server. Btw, I'm not worried about waiting on write in this case, but that actually is an issue not taken care of by the networking layer.

Parand 2009-03-21 18:07:50

@Parand -- No, but you can set up a cheap passthru proxy in your domain and have them pull through that.

MarkusQ 2009-03-21 19:39:58

Answer 2

A:

I don't know if this will exactly work, but it looks like it might: EvServer: Python Asynchronous WSGI Server has a web.py interface and can do comet style push to the browser client.

If that isn't right, maybe you can use the Concurrence HTTP client for async download of the pages and figure out how to serve them to browser via ajax or comet.

Van Gale 2009-03-21 00:53:59

Answer 3

A:

Along the lines of MarkusQ's answer, MochiKit is a nice JavaScript library, with robust async methods inspired by Twisted.

goldenratio 2009-03-21 03:08:26

Answer 4

A:

Actually you can integrate twisted with web.py. I'm not really sure how as I've only done it with django (used twisted with it).

Vasil 2009-03-21 04:09:41

Answer 5

+2 A:

One option would be to post the work onto a queue of some sort (you could use something Enterprisey like ActiveMQ with pyactivemq or STOMP as a connector or you could use something lightweight like Kestrel which is written in Scala and speaks the same protocl as memcache so you can just use the python memcache client to talk to it).

Once you have the queueing mechanism set up, you can create as many or as few worker tasks that are subscribed to the queue and do the actual downloading work as you want. You can even have them live on other machines so they don't interfere with the speed of serving yourwebsite at all. When the workers are done, they post the results back to the database or another queue where the webserver can pick them up.

If you don't want to have to manage external worker processes then you could make the workers threads in the same python process that is running the webserver, but then obviously it will have greater potential to impact your web page serving performance.

John 2009-03-21 05:02:48

Answer 6

+1 A:

I'd just build a service in twisted that did that concurrent fetch and analysis and access that from web.py as a simple http request.

Dustin 2009-03-21 05:11:00

Answer 7

A:

Use the async http client which uses asynchat and asyncore. http://sourceforge.net/projects/asynchttp/files/asynchttp-production/asynchttp.py-1.0/asynchttp.py/download

dhruvbird 2010-04-23 08:24:40

I have a few bug-fixes to the asynchttpclient code. I tried mailing the author, but he doesn't seem to be around. If you want those fixes, you can email me.I have additionally also enabled HTTP request pipelining, which should give an additional boost to speeds for many smallish requests.

dhruvbird 2010-06-10 17:45:44

You can find the bug-fixes and extensions to the asynchttp client here: http://code.google.com/p/asynhttp/

dhruvbird 2010-07-06 05:23:32

Answer 8

+1 A:

You might be able to use urllib to download the files and the Queue module to manage a number of worker threads. e.g:

from __future__ import with_statement
import urllib
from threading import Thread
from Queue import Queue

NUM_WORKERS = 20

class Dnld:
    def __init__(self):
        self.Q = Queue()
        for i in xrange(NUM_WORKERS):
            t = Thread(target=self.worker)
            t.setDaemon(True)
            t.start()

    def worker(self):
        while 1:
            url, Q = self.Q.get()
            try:
                f = urllib.urlopen(url)
                Q.put(('ok', url, f.read()))
                f.close()
            except Exception, e:
                Q.put(('error', url, e))
                try: f.close() # clean up
                except: pass

    def download_urls(self, L):
        Q = Queue() # Create a second queue so the worker 
                    # threads can send the data back again
        for url in L:
            # Add the URLs in `L` to be downloaded asynchronously
            self.Q.put((url, Q))

        rtn = []
        for i in xrange(len(L)):
            # Get the data as it arrives, raising 
            # any exceptions if they occur
            status, url, data = Q.get()
            if status == 'ok':
                rtn.append((url, data))
            else:
                raise data
        return rtn

inst = Dnld()
for url, data in inst.download_urls(['http://www.google.com']*2):
    print url, data

David Morrissey 2010-04-23 08:56:24

ansaurus

tags:

views:

answers:

Python: simple async download of url content?

related questions