What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1
and df.2
?
I do not want the output to count rows with NaN
, which pandas
built-in correlation does. But I also want it to output a pvalue
or a standard error, which the built-in does not.
SciPy
seems to get caught up by the NaNs, though I believe it does report significance.
Data example:
1 2
0 2 NaN
1 NaN 1
2 1 2
3 -4 3
4 1.3 1
5 NaN NaN
To calculate all the p-values at once, you can use calculate_pvalues
function (code below):
df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] })
calculate_pvalues(df)
The output is similar to the corr()
(but with p-values):
A B C
A 0 0.7877 0.1789
B 0.7877 0 0.6088
C 0.1789 0.6088 0
Details:
calculate_pvalues(df[['A','B','C']]
from scipy.stats import pearsonr
import pandas as pd
def calculate_pvalues(df):
df = df.dropna()._get_numeric_data()
dfcols = pd.DataFrame(columns=df.columns)
pvalues = dfcols.transpose().join(dfcols, how='outer')
for r in df.columns:
for c in df.columns:
pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
return pvalues