How do I find the values in my numpy array that are NaN/infinity/too large for dtype('float64')?

sometimesiwritecode picture sometimesiwritecode · Mar 16, 2019 · Viewed 9.1k times · Source

I am trying to fit a simple machine learning model using scikit learn. Upon this line:

clf.fit(features, labels)

I get a familiar error:

 Input contains NaN, infinity or a value too large for dtype('float64').

Whenever I have encountered this before it has been when there where NaN values in my data. I have confirmed there are no NaNs in the data. The two inputs to the .fit() method (features and labels) are np arrays but they are produced from a pandas dataframe. Right before pulling the NaN values I printed:

print(features_df[features_df.isnull().any(axis=1)])
print(labels_df[labels_df.isnull().any(axis=1)])

This printed empty dataframes so i know there is no row with a NaN value in it. I also checked the numpy arrays for NaN values after the conversion and even summed them successfully with the np sum() method so there are no NaN values in the features or labels np arrays passed into fit.

This means there must be infinity values or really large values, both of which I find hard to believe. Is there some way I can print any values in the dataframe or np array that:

are NaN, infinity or a value too large for dtype('float64')?

I need to have them specifically pointed out to me as I can't find them with my eyes and there are no NaN values.

Answer

fountainhead picture fountainhead · Mar 16, 2019

Assuming this is the numpy array, with shape (3,3):

ar = np.array([1, 2, 3, 4, np.nan, 5, np.nan, 6, np.inf]).reshape((3,3))
print (ar)
[[ 1.  2.  3.]
 [ 4. nan  5.]
 [nan  6. inf]]

To check for NaN, positive infinity, negative infinity, or different combinations of those, we can use:

numpy.isnan(ar)     # True wherever nan
numpy.isposinf(ar)  # True wherever pos-inf
numpy.isneginf(ar)  # True wherever neg-inf
numpy.isinf(ar)     # True wherever pos-inf or neg-inf
~numpy.isfinite(ar) # True wherever pos-inf or neg-inf or nan

respectively. Each of these returns a bool array, and passing the bool array to numpy.where() gives us two index arrays (one index array per dimension of ar):

ar_nan = np.where(np.isnan(ar))
print (ar_nan)

(array([1, 2], dtype=int64), array([1, 0], dtype=int64)) # Means, nans at (1,1) and (2,0)

and

ar_inf = np.where(np.isinf(ar))
print (ar_inf)

(array([2], dtype=int64), array([2], dtype=int64)) # Means, inf is at (2,2)

Also, to see the limits of float64:

np.finfo(np.float64)

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)