views:

88

answers:

3

The real goal here is to find the quantile means (or sums, or median, etc.) in Python. Since I'm not a power user of Python but have used R for a while, my chosen route is via Rpy. However, I ran into the problem that the returned list of means are not correspondent to the order of the quantiles. In particular, I have the followings in R:

> a = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> b = c(2, 4, 20, 40, 200, 400, 2000, 4000, 20000, 40000)
> prob = seq(0,5)/5
> br = quantile(a,prob)
> rcut = cut(a, br, include.lowest = TRUE)
> quintile_means = tapply(b, rcut, mean)
> quintile_means
[1,2.8] (2.8,4.6] (4.6,6.4] (6.4,8.2]  (8.2,10] 
      3        30       300      3000     30000 

which is all very good. However, if I translate the code into Rpy, I got

>>> import rpy
>>> from rpy import r
>>> a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> b = [2, 4, 20, 40, 200, 400, 2000, 4000, 20000, 40000]
>>> prob = [ x / 5.0 for x in range(6)]
>>> br = r.quantile(a, prob)
>>> rcut = r.cut(a, br, include_lowest=r.TRUE)
>>> quintile_means = r.tapply(b, rcut, r.mean)
>>> print quintile_means
[30.0, 300.0, 3000.0, 30000.0, 3.0]

Note the final list is mis-ordered (we know it because a and b are both ordered in this case). In general, I just have no way to recover the correct order from the lowest to highest quantile in Rpy. Any suggestions?

In addition (not in substitution, as I'd like to know the answer to the above question), if you can suggest a way to directly perform the analysis in python, that will be great too. (I don't have numpy or scipy installed.) Thx!

EDIT: To clarify, a and b are paired but not necessarily ordered. For example, a is the size of eyes and b is the size of nose. I'm trying to find out that in the various quantiles of a, what are the means of the correspondent bs. Thanks.

A: 

I just have no way to recover the correct order from the lowest to highest quantile in Rpy

If sorting the list from the lowest to the highest solves your problem, try sorted(quintile_means).

leoluk
No, that doesn't solve the problem. For example, if `b = [20, 40, 2, 4, 200, 400, ...]`, then the correct ordered output should be `[30, 3, 300, ...]` Would have done that if it's so simple.
Zhang18
+2  A: 

Try rpy2.

With rpy2 >= 2.1.0, this could be:

from rpy2.robjects.vectors import IntVector
from rpy2.robjects.packages import importr
base = importr('base')
stats = importr('stats')

a = IntVector((1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
b = IntVector((2, 4, 20, 40, 200, 400, 2000, 4000, 20000, 40000))
prob = base.seq(0,5).ro / 5
br = stats.quantile(a,prob)
rcut = base.cut(a, br, include_lowest = True)
quintile_means = base.tapply(b, rcut, stats.mean)
print(quintile_means)
lgautier
+2  A: 

If you don't need labels (e.g: (8.2,10]) then you could call cut with labels=FALSE. This should keep order (and speed up your code for free).

Marek
Works like a charm. Thx.
Zhang18