ansaurus

Question

making binned boxplot in matplotlib with numpy and scipy in Python

Answer 1

+1 A:

You're getting the 3rd bin for the maximum value in the array (I'm assuming you have a typo there, and max_x should be "max(my_array[:,0])" instead of "max(my_array[:,1])"). You can avoid this by adding 1 (or any positive number) to the last bin.

Also, if I'm understanding you correctly, you want to bin one variable by another, so my example below shows that. If you're using recarrays (which are much slower) there are also several functions in matplotlib.mlab (e.g. mlab.rec_groupby, etc) that do this sort of thing.

Anyway, in the end, you might have something like this (to bin x by the values in y, assuming x and y are the same length)

def bin_by(x, y, nbins=30):
    """
    Bin x by y.
    Returns the binned "x" values and the left edges of the bins
    """
    bins = np.linspace(y.min(), y.max(), nbins+1)
    # To avoid extra bin for the max value
    bins[-1] += 1 

    indicies = np.digitize(y, bins)

    output = []
    for i in xrange(1, len(bins)):
        output.append(x[indicies==i])

    # Just return the left edges of the bins
    bins = bins[:-1]

    return output, bins

As a quick example:

In [3]: x = np.random.random((100, 2))

In [4]: binned_values, bins = bin_by(x[:,0], x[:,1], 2)

In [5]: binned_values
Out[5]: 
[array([ 0.59649575,  0.07082605,  0.7191498 ,  0.4026375 ,  0.06611863,
        0.01473529,  0.45487203,  0.39942696,  0.02342408,  0.04669615,
        0.58294003,  0.59510434,  0.76255006,  0.76685052,  0.26108928,
        0.7640156 ,  0.01771553,  0.38212975,  0.74417014,  0.38217517,
        0.73909022,  0.21068663,  0.9103707 ,  0.83556636,  0.34277006,
        0.38007865,  0.18697416,  0.64370535,  0.68292336,  0.26142583,
        0.50457354,  0.63071319,  0.87525221,  0.86509534,  0.96382375,
        0.57556343,  0.55860405,  0.36392931,  0.93638048,  0.66889756,
        0.46140831,  0.01675165,  0.15401495,  0.10813141,  0.03876953,
        0.65967335,  0.86803192,  0.94835281,  0.44950182]),
 array([ 0.9249993 ,  0.02682873,  0.89439141,  0.26415792,  0.42771144,
        0.12292614,  0.44790357,  0.64692616,  0.14871052,  0.55611472,
        0.72340179,  0.55335053,  0.07967047,  0.95725514,  0.49737279,
        0.99213794,  0.7604765 ,  0.56719713,  0.77828727,  0.77046566,
        0.15060196,  0.39199123,  0.78904624,  0.59974575,  0.6965413 ,
        0.52664095,  0.28629324,  0.21838664,  0.47305751,  0.3544522 ,
        0.57704906,  0.1023201 ,  0.76861237,  0.88862359,  0.29310836,
        0.22079126,  0.84966201,  0.9376939 ,  0.95449215,  0.10856864,
        0.86655289,  0.57835533,  0.32831162,  0.1673871 ,  0.55742108,
        0.02436965,  0.45261232,  0.31552715,  0.56666458,  0.24757898,
        0.8674747 ])]

Hope that helps a bit!

Joe Kington 2010-04-26 23:56:48

Answer 2

+1 A:

Numpy has a dedicated function for creating histograms the way you need to:

histogram(a, bins=10, range=None, normed=False, weights=None, new=None)

which you can use like:

(hist_data, bin_edges) = histogram(my_array[:,0], weights=my_array[:,1])

The key point here is to use the weights argument: each value a[i] will contribute weights[i] to the histogram. Example:

a = [0, 1]
weights = [10, 2]

describes 10 points at x = 0 and 2 points at x = 1.

You can set the number of bins, or the bin limits, with the bins argument (see the official documentation for more details).

The histogram can then be plotted with something like:

bar(bin_edges[:-1], hist_data)

If you only need to do a histogram plot, the similar hist() function can directly plot the histogram:

hist(my_array[:,0], weights=my_array[:,1])

EOL 2010-04-27 12:06:07

I don't understand why "weights" is used here after reading the docs -- could you please explain? If the point is just to assign elements to bins, I don't see why weights should play a role.

2010-04-30 05:03:26

I edited the answer so as to explain in more details the role of the weights argument. If you think that the answer is useful, please thumb it up! :)

EOL 2010-04-30 07:22:36

Actually, np.histogram won't do what he needs, unfortunately. He needs the actual values that fall into each bin in order to make a boxplot for each bin. (Or that was my understanding, anyway, correct me if I'm wrong, there!) The weights parameter just multiplies each value by each weight, so that instead of adding 1 to the count in the bin, it adds weights[i]. That's a different effect than binning one array by the values in another, and regardless, doesn't return the subset of the array that falls into each bin. (Or maybe I'm completely misunderstanding things?)

Joe Kington 2010-04-30 20:08:34

@Joe: I see what you mean. Whatever the answer, one of our responses should be correct, so they are both useful. :)

EOL 2010-05-03 07:26:32

@EOL - True! :)

Joe Kington 2010-05-03 14:20:53

ansaurus

tags:

views:

answers:

making binned boxplot in matplotlib with numpy and scipy in Python

related questions