I want a fast way to grab a URL and parse it while streaming. Ideally this should be super fast. My language of choice is Python. I have an intuition that twisted can do this but I'm at a loss to find an example.
You only need to parse a single URL? Then don't worry. Use urllib2 to open the connection and pass the file handle into ElementTree.
Variations you can try would be to use ElementTree's incremental parser or to use iterparse, but that depends on what your real requirements are. There's "super fast" but there's also "fast enough."
It's only when you start having multiple simultaneous connections where you should look at Twisted or multithreading.
If you need to handle HTTP responses in a streaming fashion, there are a few options.
You can do it via downloadPage
:
from xml.sax import make_parser
from twisted.web.client import downloadPage
class StreamingXMLParser:
def __init__(self):
self._parser = make_parser()
def write(self, bytes):
self._parser.feed(bytes)
def close(self):
self._parser.feed('', True)
parser = StreamingXMLParser()
d = downloadPage(url, parser)
# d fires when the response is completely received
This works because downloadPage
writes the response body to the file-like object passed to it. Here, passing in an object with write
and close
methods satisfies that requirement, but incrementally parses the data as XML instead of putting it on a disk.
Another approach is to hook into things at the HTTPPageGetter
level. HTTPPageGetter
is the protocol used internally by getPage
.
class StreamingXMLParsingHTTPClient(HTTPPageGetter):
def connectionMade(self):
HTTPPageGetter.connectionMade(self)
self._parser = make_parser()
def handleResponsePart(self, bytes):
self._parser.feed(bytes)
def handleResponseEnd(self):
self._parser.feed('', True)
self.handleResponse(None) # Whatever you pass to handleResponse will be the result of the Deferred below.
factory = HTTPClientFactory(url)
factory.protocol = StreamingXMLParsingHTTPClient
reactor.connectTCP(host, port, factory)
d = factory.deferred
# d fires when the response is completely received
Finally, there will be a new HTTP client API soon. Since this isn't part of any release yet, it's not as directly useful as the previous two approaches, but it's somewhat nicer, so I'll include it to give you an idea of what the future will bring. :) The new API lets you specify a protocol to receive the response body. So you'd do something like this:
class StreamingXMLParser(Protocol):
def __init__(self):
self.done = Deferred()
def connectionMade(self):
self._parser = make_parser()
def dataReceived(self, bytes):
self._parser.feed(bytes)
def connectionLost(self, reason):
self._parser.feed('', True)
self.done.callback(None)
from twisted.web.client import Agent
from twisted.internet import reactor
agent = Agent(reactor)
d = agent.request('GET', url, headers, None)
def cbRequest(response):
# You can look at the response headers here if you like.
protocol = StreamingXMLParser()
response.deliverBody(protocol)
return protocol.done
d.addCallback(cbRequest) # d fires when the response is fully received and parsed