views:

71

answers:

2

I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.

In Python, SAS's approach might look something like:

with load('datastore') as data:
  for row in rows(data):
    row.logincome = row.log(income)
    row.rich = "Rich" if row.income > 100000 else "Poor"

This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:

data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")

This will execute extremely quickly because log and ifelse are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.

Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?

Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.

A: 

I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.

Antoine P.
A: 

I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.

For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:

for row in rows(data):
    # do stuff with row

Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:

def riches():
    rich_data = []
    while True:
        row = (yield)
        if row == None: break
        rich_data.append("Rich" if row.income > 100000 else "Poor")
    yield rich_data

The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.

Going back to the caller loop, it could look someting like this:

richConsumer = riches()
richConsumer.next()  # advance to first yield
for row in rows(data):
    richConsumer.send(row)
    # other consumers.send(row) here
richConsumer.send(None)  # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data

I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH

ThomasH
I'm not sure I fully follow that you're trying to do. How does this improve over my first option which just loops over rows, where rows are a view onto the underlying data?
Tristan
Well, for one it confirms the need for a top-level loop. For another, I thought one of your concerns was to have processing code out of the loop body and in separate functions (you spoke of the advantage of "compiled functions" although I'm not sure how much sense this makes in Python; but it makes a lot of sense if you are thinking of pushing those functions out to C code later). A third concern was a neat API for the processing functions. Given those constraints I thought this producer-consumer approach was a good compromise. But YMMV.
ThomasH