Python ARIMA exogenous variable out of sample

Jim Crozier picture Jim Crozier · Jul 30, 2014 · Viewed 14.7k times · Source

I am trying to predict a time series in python statsmodels ARIMA package with the inclusion of an exogenous variable, but cannot figure out the correct way to insert the exogenous variable in the predict step. See here for docs.

import numpy as np
from scipy import stats
import pandas as pd

import statsmodels.api as sm

vals = np.random.rand(13)
ts = pd.TimeSeries(vals)
df = pd.DataFrame(ts, columns=["test"])
df.index = pd.Index(pd.date_range("2011/01/01", periods = len(vals), freq = 'Q'))

fit1 = sm.tsa.ARIMA(df, (1,0,0)).fit()
#this works fine:
pred1 = fit1.predict(start=12, end = 16)
print(pred1)

Out[32]: 
2014-03-31    0.589121
2014-06-30    0.747575
2014-09-30    0.631322
2014-12-31    0.654858
2015-03-31    0.650093
Freq: Q-DEC, dtype: float64

now add in a trend exogenous variable

exogx = np.array(range(1,14))
#to make this easy, let's look at the ols of the trend (arima(0,0,0))
fit2 = sm.tsa.ARIMA(df, (0,0,0),exog = exogx).fit()
print(fit2.params)

const    0.555226
x1       0.013132
dtype: float64

print(fit2.fittedvalues)

2011-03-31    0.568358
2011-06-30    0.581490
2011-09-30    0.594622
2011-12-31    0.607754
2012-03-31    0.620886
2012-06-30    0.634018
2012-09-30    0.647150
2012-12-31    0.660282
2013-03-31    0.673414
2013-06-30    0.686546
2013-09-30    0.699678
2013-12-31    0.712810
2014-03-31    0.725942
Freq: Q-DEC, dtype: float64

Notice, as we would expect, this is a trend line, increasing 0.013132 with each increase tick in time (of course this is random data, so if you run it the values will be different, but the positive or negative trend story will be the same). So, the next value (for time = 14) should be 0.555226 + 0.013132*14 = 0.739074.

#out of sample exog should be (14,15,16)
pred2 = fit2.predict(start = 12, end = 16, exog = np.array(range(13,17)))
print(pred2)
2014-03-31    0.725942
2014-06-30    0.568358
2014-09-30    0.581490
2014-12-31    0.594622
2015-03-31    0.765338
Freq: Q-DEC, dtype: float64

So, 2014-03-31 predicts (the last insample) correctly, but 2014-06-30 starts back at the beginning (t = 1), but notice 2015-03-31 (actually, always the last observation of the forecast, regardless of horizon) picks up t = 16 (that is, (value - intercept)/beta = (0.765338 - 0.555226)/0.013132).

To make this more clear, notice what happens when I inflate the values of of the x mat

fit2.predict(start = 12, end = 16, exog = np.array(range(13,17))*10000)
Out[41]: 
2014-03-31       0.725942
2014-06-30       0.568358
2014-09-30       0.581490
2014-12-31       0.594622
2015-03-31    2101.680532
Freq: Q-DEC, dtype: float64

See that 2015-03-31 exploded, but none of the other xmat values were considered? What am I doing wrong here???

I have tried playing around with every way that I know how for passing in the exog variable (changing dimension, making the exog a matrix, making the exog as long as input plus the horizon, etc, etc, etc). Any suggestions would be really appreciated.

I am using 2.7 from Anaconda2.1 numpy 1.8.1 scipy 0.14.0 pandas 0.14.0 statsmodels 0.5.0

and have verified the issue on windows 7 64 bit, and centos 64 bit.

Also, a few things. I am using ARIMA for the ARIMA functionality and the above is just for illustration (that is, I cannot "just use OLS...", as I imagine will be suggested). I also cannot "just use R" due to the restrictions of the project (and more generally, the lack of support of R in base Spark).

Here are the interesting parts of the code all together in case you want to try it yourself

import numpy as np
from scipy import stats
import pandas as pd
import statsmodels.api as sm

vals = np.random.rand(13)
ts = pd.TimeSeries(vals)
df = pd.DataFrame(ts, columns=["test"])
df.index = pd.Index(pd.date_range("2011/01/01", periods = len(vals), freq = 'Q'))

exogx = np.array(range(1,14))
fit2 = sm.tsa.ARIMA(df, (0,0,0),exog = exogx).fit()
print(fit2.fittedvalues)
pred2 = fit2.predict(start = 12, end = 16, exog = np.array(range(13,17))*10000)
print(pred2)

Answer

jseabold picture jseabold · Jul 31, 2014

This is probably better posted on the github issue tracker. I filed a ticket though.

It's best to file a ticket there, if not I might forget it. Quite busy these days.

There was a bug in the logic for the special case of k_ar == 0. Should be fixed. Let me know if you can/cannot give that patch a spin. If not, I can do some more rigorous testing and merge it.

Statsmodels on top of spark? I'm intrigued.