views:

280

answers:

2

I want a fast way to grab a URL and parse it while streaming. Ideally this should be super fast. My language of choice is Python. I have an intuition that twisted can do this but I'm at a loss to find an example.

A: 

You only need to parse a single URL? Then don't worry. Use urllib2 to open the connection and pass the file handle into ElementTree.

Variations you can try would be to use ElementTree's incremental parser or to use iterparse, but that depends on what your real requirements are. There's "super fast" but there's also "fast enough."

It's only when you start having multiple simultaneous connections where you should look at Twisted or multithreading.

Andrew Dalke
I'm updating a sqllite database every 60 seconds from an 80 megabyte xml stream, if I could stream the xml, parse it and update the database before the whole thing completes, it would be awesome! Maybe I'm being a bit optimistic, but it seems like twisted should be able to help me with this.
Influx
Like I said, you've a single input stream. Twisted won't make one bit of difference. If you only need a bit of data from the XML stream you might write a SAX handler directly, which is going to be tedious but about as fast as you can get in Python code. Try it out. if it works - you're done! Looking at http://codespeak.net/lxml/performance.html you should be able to read at least 3MB/second so be able to parse that file in 30 seconds.
Andrew Dalke
I think I read the timing information wrong. It looks like it takes 0.14s on a modern machine to parse a 3MB file, for one test case, so 80MB should take under 5 seconds. Like I said, time it for yourself.
Andrew Dalke
+2  A: 

If you need to handle HTTP responses in a streaming fashion, there are a few options.

You can do it via downloadPage:

from xml.sax import make_parser
from twisted.web.client import downloadPage

class StreamingXMLParser:
    def __init__(self):
        self._parser = make_parser()

    def write(self, bytes):
        self._parser.feed(bytes)

    def close(self):
        self._parser.feed('', True)

parser = StreamingXMLParser()
d = downloadPage(url, parser)
# d fires when the response is completely received

This works because downloadPage writes the response body to the file-like object passed to it. Here, passing in an object with write and close methods satisfies that requirement, but incrementally parses the data as XML instead of putting it on a disk.

Another approach is to hook into things at the HTTPPageGetter level. HTTPPageGetter is the protocol used internally by getPage.

class StreamingXMLParsingHTTPClient(HTTPPageGetter):
    def connectionMade(self):
        HTTPPageGetter.connectionMade(self)
        self._parser = make_parser()

    def handleResponsePart(self, bytes):
        self._parser.feed(bytes)

    def handleResponseEnd(self):
        self._parser.feed('', True)
        self.handleResponse(None) # Whatever you pass to handleResponse will be the result of the Deferred below.

factory = HTTPClientFactory(url)
factory.protocol = StreamingXMLParsingHTTPClient
reactor.connectTCP(host, port, factory)
d = factory.deferred
# d fires when the response is completely received

Finally, there will be a new HTTP client API soon. Since this isn't part of any release yet, it's not as directly useful as the previous two approaches, but it's somewhat nicer, so I'll include it to give you an idea of what the future will bring. :) The new API lets you specify a protocol to receive the response body. So you'd do something like this:

class StreamingXMLParser(Protocol):
    def __init__(self):
        self.done = Deferred()

    def connectionMade(self):
        self._parser = make_parser()

    def dataReceived(self, bytes):
        self._parser.feed(bytes)

    def connectionLost(self, reason):
        self._parser.feed('', True)
        self.done.callback(None)

from twisted.web.client import Agent
from twisted.internet import reactor

agent = Agent(reactor)
d = agent.request('GET', url, headers, None)
def cbRequest(response):
    # You can look at the response headers here if you like.
    protocol = StreamingXMLParser()
    response.deliverBody(protocol)
    return protocol.done
d.addCallback(cbRequest) # d fires when the response is fully received and parsed
Jean-Paul Calderone