I'm looking for the fastest way to check for the occurrence of NaN (np.nan
) in a NumPy array X
. np.isnan(X)
is out of the question, since it builds a boolean array of shape X.shape
, which is potentially gigantic.
I tried np.nan in X
, but that seems not to work because np.nan != np.nan
. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)
Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum
in place of numpy.min
:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min
, sum
doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum
is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min
is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum
's throughput seems constant regardless of whether there are NaNs and where they're located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop