views:

115

answers:

6

I'm writing a c extension to calculate he standard deviation. Performance is important because it will be performed over large data sets. I'm having a hard time figuring out how to get the value of pyobject once I get the item from a list. This is my first time writing a c extension for python and any help is appreciated. Apparently I don't know how to use the code sample button correctly :(

This is what I have so far:

    #include <Python.h>
static PyObject*
func(PyObject *self, PyObject *args)
{
  PyObject *list, *item;
  Py_ssize_t i, len;
  if (!PyArg_UnpackTuple(args, "func", 1, 1, &list)){
    return NULL;
  }
  printf("hello world\n");
  Py_INCREF(list);
  len = PyList_GET_SIZE(list);
  for (i=0;i<len;i++){
    item = PyList_GET_ITEM(list, i);
    PyObject_Print(item,stdout,0);
  }
  return list;
}

static char func_doc[] = "This function calculates standard deviation.";

static PyMethodDef std_methods[] = {
  {"func", func, METH_VARARGS, func_doc},
  {NULL, NULL}
};

PyMODINIT_FUNC
initstd(void)
{
  Py_InitModule3("std", std_methods, "This is a sample docstring.");
}
+4  A: 

You may be reinventing the wheel. There are several scientific computing libraries for Python, such as SciPy and Numpy, which are mostly wrappers around C libraries, that implement functions such as standard deviation.

Chris S
I'm currently using numpy for the calculation but the list has to be converted to a numpy array first and I'd like to avoid that since the lists are large and the entire data set is hundreds of megabytes. I'm not an expert though and maybe that doesn't cost much in computing performance but I'd like to see the difference in speed using one of the python profilers.
Xavier
Converting to a Numpy array may not be as big a deal as you imagine. If you're loading your data from a file or database, you're already doing considerable overhead to load that as a normal Python list. If you load the data directly into a Numpy array to begin with, you'll eliminate that overhead. Even if you have to "convert" from a normal list to an array, you can save your Numpy array for fast loading later on. http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html
Chris S
A: 

This method will be limited by the number of items in the list.

Another design would keep a running total and let you add points until you overflowed the double.

duffymo
+1  A: 

Once you have item, you can get its float value with PyNumber_Float:

PyObject* floatitem = PyNumber_Float(item);

Now you need to check and exit on error (if(!floatitem) return 0 -- or a goto to a location where you decref anything you may have incref'd in the previous part of your code, e.g. in your case list). If no error, PyFloat_AsDouble gives you the required double value for use in the rest of your C-coded loop:

double ditem = PyFloat_AsDouble(floatitem);

after which you can decref floatitem and go your merry way. Don't worry overmuch about conversion overhead in PyNumber_Float -- there won't be any if you're passed a list of floats in the first place;-). If you do still worry (would rather give an error if somebody does pass a non-float requiring conversion) you can use PyFloat_Check if you insist (but I'd suggest at least special-casing int and long items unless you want truly perplexed and unhappy users;-). In a similar vein, I'd also strongly recommend studying and using PySequence_Fast and friends, rather than astonishing users by specifically requiring lists rather than other types of sequences!-).

Alex Martelli
+1  A: 

Just to mention that there is almost certainly a better way than writing a C extension.

First option is to use NumPy. In the comment you have on the other answer you mention that it is expensive to convert the list to an array. This may be true if the standard deviation calculation is the only bit you are doing with the data which is highly unlikely.

Barring that, I would go for Cython. Here is a comparison of Cython and NumPy. Cython underperforms NumPy in this case, but more importantly the code implemented for csum can trivially be changed to compute the standard deviation.

Muhammad Alkarouri
+1  A: 

Have you considered using cython to write your extension. It is perfect for this type of thing

gnibbler
A: 

If you want simple statistics over large datasets, you can randomly sample a subset of the data and take the average and standard deviation of that. That will have a "standard error" of approximation, and the more samples you take, the smaller that will be. If you don't need high precision of the statistics, you don't need to read all the data.

Mike Dunlavey