ansaurus

Question

Delayed execution in python for big data

Answer 1

A:

I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.

Antoine P. 2010-01-05 23:42:47

Answer 2

A:

I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.

For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:

for row in rows(data):
    # do stuff with row

Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:

def riches():
    rich_data = []
    while True:
        row = (yield)
        if row == None: break
        rich_data.append("Rich" if row.income > 100000 else "Poor")
    yield rich_data

The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.

Going back to the caller loop, it could look someting like this:

richConsumer = riches()
richConsumer.next()  # advance to first yield
for row in rows(data):
    richConsumer.send(row)
    # other consumers.send(row) here
richConsumer.send(None)  # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data

I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH

ThomasH 2010-01-07 22:19:15

I'm not sure I fully follow that you're trying to do. How does this improve over my first option which just loops over rows, where rows are a view onto the underlying data?

Tristan 2010-01-07 23:56:00

Well, for one it confirms the need for a top-level loop. For another, I thought one of your concerns was to have processing code out of the loop body and in separate functions (you spoke of the advantage of "compiled functions" although I'm not sure how much sense this makes in Python; but it makes a lot of sense if you are thinking of pushing those functions out to C code later). A third concern was a neat API for the processing functions. Given those constraints I thought this producer-consumer approach was a good compromise. But YMMV.

ThomasH 2010-01-08 08:18:45

ansaurus

tags:

views:

answers:

Delayed execution in python for big data

related questions