I coundn't find a function that computes a matrix of correlation coefficients for arrays containing observations for more than two variables when there are NaNs in the data. There are functions doing this for pairs of variables (or just masking the arrays using ~is.nan()). But using these functions by looping over a large number of variables, computing the correlation for each pair can be very time consuming.
So I tried on my own and soon realized that the complexity of doing this is a question of the proper normalization of the Covariance. I would be very interest in your opinions on how to do it.
Here is the code:
def nancorr(X,nanfact=False):
X = X - np.nanmean(X,axis=1,keepdims = True)*np.ones((1,X.shape[1]))
if nanfact:
mask = np.isnan(X).astype(int)
fact = X.shape[1] - np.dot(mask,mask.T) - 1
X[np.isnan(X)] = 0
if nanfact:
cov = np.dot(X,X.T)/fact
else:
cov = np.dot(X,X.T)
d = np.diag(cov)
return cov/np.sqrt(np.multiply.outer(d,d))
The function assumes that each row is a variable. It is basically an adjusted code from numpy's corrcoeff(). I believe there are three ways of doing this:
(1) For each pair of variables, you take only those observations for which neither one nor the other variable is NaN. This is arguably the most accurate, but also most difficult one to program if you want to do the computation for more than one pair simultaneously and not covered in the above code. Why, however, throw away information on the mean and variance of each variable, just because the corresponding entry of another variable is NaN? Hence, two other options.
(2) We demean each variable by it nanmean and the variance of each variable is its nanvariance. For the covariance, each observation where one or the other variable is NaN, but not both, is an observation of no-covariation and, therefore, set to zero. The factor of the covariance is then 1/(# of observation where not both variables are NaN - 1), denoted by n. Both variances in the denominator of the correlation coefficient are factored by their corresponding number of non-NaN observations minus 1, denoted by n1 and n2 respectively. This is achived by setting nanfact=True in the function above.
(3) One may wish that the covariance and the variances have the same factor as it is the case for correlation coefficient without NaNs. The only meaningful way to do this here (if option (1) is not feasable), is to simply ignore (1/n)/sqrt(1/n1*n2). Since this number is smaller than one, the estimated correlation coefficients will be larger (in absolute value) than in (2), but will remain between -1,1. This is achieved by setting nanfact=False.
I'd be very interested in your opinions on approaches (2) and (3) and especially, I would very much like to see a solution to (1) without the use of loops.
I think the method you are looking for is corr()
from pandas. For example, a dataframe as following. You can also refer to this question. How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?
import pandas as pd
df = pd.DataFrame({'A': [2, None, 1, -4, None, None, 3],
'B': [None, 1, None, None, 1, 3, None],
'C': [2, 1, None, 2, 2.1, 1, 0],
'D': [-2, 1.1, 3.2, 2, None, 1, None]})
df
A B C D 0 2 NaN 2 -2 1 NaN 1 1 1.1 2 1 NaN NaN 3.2 3 -4 NaN 2 2 4 NaN 1 2.1 NaN 5 NaN 3 1 1 6 3 NaN 0 NaN
rho = df.corr()
rho
A B C D A 1.000000 NaN -0.609994 -0.441784 B NaN 1.0 -0.500000 -1.000000 C -0.609994 -0.5 1.000000 -0.347928 D 0.041204 -1.0 -0.347928 1.000000