How can I ignore zeros when I take the median on columns of an array?

tumultous_rooster picture tumultous_rooster · Feb 26, 2014 · Viewed 10.1k times · Source

I have a simple numpy array.

array([[10,   0,  10,  0],
       [ 1,   1,   0,  0]
       [ 9,   9,   9,  0]
       [ 0,  10,   1,  0]])

I would like to take the median of each column, individually, of this array.

However, there are a few 0 values in various places which I would like to ignore in the calculation of the medians.

To further complicate, I would like to keep the columns with only 0 entries as having the median of 0. In this manner, those columns would serve as a bit of a place holder, keeping the dimensions of the matrix the same.

The numpy documentation doesn't have any argument that would work for what I want (maybe I am spoiled by the many switches we get with R!)

numpy.median(a, axis=None, out=None, overwrite_input=False)[source]

Can someone please shed some light on an effective way to do this, which is in line with the spirit of numpy? I could hack it out but in that case I feel like I've defeated the purpose of using numpy in the first place.

Thanks in advance.

Answer

CT Zhu picture CT Zhu · Feb 26, 2014

Masked array is always handy, but slooooooow:

In [14]:

%timeit np.ma.median(y, axis=0).filled(0)
1000 loops, best of 3: 1.73 ms per loop
In [15]:

%%timeit
ans=np.apply_along_axis(lambda v: np.median(v[v!=0]), 0, x)
ans[np.isnan(ans)]=0.
1000 loops, best of 3: 402 µs per loop

In [16]:

ans=np.apply_along_axis(lambda v: np.median(v[v!=0]), 0, x)
ans[np.isnan(ans)]=0.; ans
Out[16]:
array([ 9.,  9.,  9.,  0.])

np.nonzero is even faster:

In [25]:

%%timeit
ans=np.apply_along_axis(lambda v: np.median(v[np.nonzero(v)]), 0, x)
ans[np.isnan(ans)]=0.
1000 loops, best of 3: 384 µs per loop