views:

540

answers:

3

I have an XML document coming in over a socket that I need to parse and react to on the fly (ie parsing a partial tree). What I'd like is a non blocking method of doing so, so that I can do other things while waiting for more data to come in (without threading).

Something like iterparse would be ideal if it finished iterating when the read buffer was empty, eg:

context = iterparse(imaginary_socket_file_wrapper)
while 1:
    for event, elem in context:
        process_elem(elem)
    # iteration of context finishes when socket has no more data
    do_other_stuff()
    time.sleep(0.1)

I guess SAX would also be an option, but iterparse just seems simpler for my needs. Any ideas?

Update:

Using threads is fine, but introduces a level of complexity that I was hoping to sidestep. I thought that non-blocking calls would be a good way to do so, but I'm finding that it increases the complexity of parsing the XML.

+1  A: 

If you won't use threads, you can use an event loop and poll non-blocking sockets.

asyncore is the standard library module for such stuff. Twisted is the async library for Python, but complex and probably a bit heavyweight for your needs.

Alternatively, multiprocessing is the non-thread thread alternative, but I assume you aren't running 2.6.

One way or the other, I think you're going to have to use threads, extra processes or weave some equally complex async magic.

wbg
asyncore looks great, but I would still need a way of parsing the XML that doesn't exceed the complexity of just putting iterparse into another thread (say)
Peter Gibson
Right. Wasn't sure why threads were out. That Eventlets looks good, but not significantly simpler than a thread in this case. The way I see it, if you want concurrent/async/threaded behaviour, you'll have to pay the price for that complexity one way or another.An XML library (or almost any other library) isn't going to come with that kind of functionality baked in, because it's been done better at a different abstraction level.
wbg
+3  A: 

I think there are two components to this, the non-blocking network I/O, and a stream-oriented XML parser.

For the former, you'd have to pick a non-blocking network framework, or roll your own solution for this. Twisted certainly would work, but I personally find inversion of control frameworks difficult to wrap my brain around. You would likely have to keep track of a lot of state in your callbacks to feed the parser. For this reason I tend to find Eventlet a bit easier to program to, and I think it would fit well in this situation.

Essentially it allows you to write your code as if you were using a blocking socket call (using an ordinary loop or a generator or whatever you like), except that you can spawn it into a separate coroutine (a "greenlet") that will automatically perform a cooperative yield when I/O operations would block, thus allowing other coroutines to run.

This makes using any stream-oriented parser trivial again, because the code is structured like an ordinary blocking call. It also means that many libraries that don't directly deal with sockets or other I/O (like the parser for instance) don't have to be specially modified to be non-blocking: if they block, Eventlet yields the coroutine.

Admittedly Eventlet is slightly magic, but I find it has a much easier learning curve than Twisted, and results in more straightforward code because you don't have to turn your logic "inside out" to fit the framework.

edarc
Twisted is indeed pretty hard to get your head around, though once you understand `Deferred`s, it's stunningly powerful. I've had a brief look at eventlet, and it's based on threads, which the OP ruled out.Didn't say why threads weren't an option, though. If it's a matter of complexity, Eventlet looks full of win.
wbg
Actually Eventlet does not require threads, but it is orthogonal to and compatible with them. It uses a C extension called "greenlet" which implements cooperative coroutines, which can be thought of as a generalized version of Python's generators: a greenlet can yield control and be resumed later right where it left off. Eventlet uses this capability to automatically yield any greenlet when it performs an I/O operation that would block, later resuming it when the I/O completes. In fact, it uses a Twisted-like reactor under the hood, but to schedule greenlets instead of exposing it directly.
edarc
Twisted seems to be focused towards HTML and HTTP. Are there any examples of using Twisted with plain ol' XML and TCP sockets?
Peter Gibson
edarc: Oh right, got it. What's the advantage of that vs using threads?
wbg
Greenlets are *much* lighter than OS threads. Think of them like Erlang processes--they are a fabrication of the virtual machine, not real OS threads with the attendant memory and context switching overhead. Thus, you get the concurrency benefits of the non-blocking event-driven model, with the straightforward control flow of a threaded model.
edarc
A: 

Diving into the iterparse source provided the solution for me. Here's a simple example of building an XML tree on the fly and processing elements after their close tags:

import xml.etree.ElementTree as etree

parser = etree.XMLTreeBuilder()

def end_tag_event(tag):
    node = self.parser._end(tag)
    print node

parser._parser.EndElementHandler = end_tag_event

def data_received(data):
    parser.feed(data)

In my case I ended up feeding it data from twisted, but it should work with a non-blocking socket also.

Peter Gibson