I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.
In Python, SAS's approach might look something like:
with load('datastore') as data:
for row in rows(data):
row.logincome = row.log(income)
row.rich = "Rich" if row.income > 100000 else "Poor"
This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:
data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")
This will execute extremely quickly because log
and ifelse
are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.
Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?
Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.