views:

84

answers:

2

In Numpy 1.4.1, what is the simplest or most efficient way of calculating the histogram of a masked array? numpy.histogram and pyplot.hist do count the masked elements, by default!

The only simple solution I can think of right now involves creating a new array with the non-masked value:

histogram(m_arr[~m_arr.mask])

This is not very efficient, though, as this unnecessarily creates a new array. I'd be happy to read about better ideas!

+2  A: 

Try hist(m_arr.compressed()).

tillsten
This is a better idea than my `m_arr[~m_arr.mask]`. However, it does not solve the problem that a new array is unnecessarily corrected.
EOL
PS: "corrected" -> "created"
EOL
+1  A: 

(Undeleting this as per discussion above...)

I'm not sure whether or not the numpy developers would consider this a bug or expected behavior. I asked on the mailing list, so I guess we'll see what they say.

Either way, it's an easy fix. Patching numpy/lib/function_base.py to use numpy.asanyarray rather than numpy.asarray on the inputs to the function will allow it to properly use masked arrays (or any other subclass of an ndarray) without creating a copy.

Edit: It seems like it is expected behavior. As discussed here:

If you want to ignore masked data it's just on extra function call

histogram(m_arr.compressed())

I don't think the fact that this makes an extra copy will be relevant, because I guess full masked array handling inside histogram will be a lot more expensive.

Using asanyarray would also allow matrices in and other subtypes that might not be handled correctly by the histogram calculations.

For anything else besides dropping masked observations, it would be necessary to figure out what the masked array definition of a histogram is, as Bruce pointed out.

Joe Kington
Thank you. One of the arguments against handling masked arrays in histograms is that if histograms handled masked values, one would have to decide how masked data with a masked array of weights should be treated. I don't think that there is any obviously better solution to this problem: it looks like `histogram()`'s features do not mix too well with masked input+weight arrays.
EOL