It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. I'm new to pandas, and I'm having trouble performing this very simple operation. I know I'm missing something obvious. There was a very similar question asked about a year-and-a-half ago, but it wasn't really answered.
The dataframe currently looks something like this:
y x1 x2
880.37 3.17 23
716.20 4.76 26
974.79 4.17 73
322.80 8.70 72
1054.25 11.45 16
And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation:
y x1 x2 y_hat res
880.37 3.17 23 840.27 40.10
716.20 4.76 26 752.60 -36.40
974.79 4.17 73 877.49 97.30
322.80 8.70 72 348.50 -25.70
1054.25 11.45 16 815.15 239.10
I've tried resolving this using statsmodels and pandas and haven't been able to solve it. Thanks in advance!
Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. We can use either the formula or the array/DataFrame interface to the models.
fittedvalues
and resid
are pandas Series with the correct index.
predict
does not return a pandas Series.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
'x2': [23, 26, 73, 72, 16],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
index=np.arange(10, 20, 2))
result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid
result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid
# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)
print(df)
x1 x2 y yhat resid yhat2 resid2 \
10 3.17 23 880.37 923.949309 -43.579309 923.949309 -43.579309
12 4.76 26 716.20 890.732201 -174.532201 890.732201 -174.532201
14 4.17 73 974.79 656.155079 318.634921 656.155079 318.634921
16 8.70 72 322.80 610.510952 -287.710952 610.510952 -287.710952
18 11.45 16 1054.25 867.062458 187.187542 867.062458 187.187542
predicted
10 923.949309
12 890.732201
14 656.155079
16 610.510952
18 867.062458
As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled:
>>> print(result.get_prediction().summary_frame())
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \
10 923.949309 268.931939 -233.171432 2081.070051 -991.466820
12 890.732201 211.945165 -21.194241 1802.658643 -887.328646
14 656.155079 269.136102 -501.844105 1814.154263 -1259.791854
16 610.510952 282.182030 -603.620329 1824.642233 -1339.874985
18 867.062458 329.017262 -548.584564 2282.709481 -1214.750941
obs_ci_upper
10 2839.365439
12 2668.793048
14 2572.102012
16 2560.896890
18 2948.875858