Adding lagged variables to an lm model?

r lm
Hugh Perkins picture Hugh Perkins · Oct 27, 2012 · Viewed 49.7k times · Source

I'm using lm on a time series, which works quite well actually, and it's super super fast.

Let's say my model is:

> formula <- y ~ x

I train this on a training set:

> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )

... and I can make predictions for new data:

> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
  x        y
1 4 4.333333
2 5 5.333333
3 6 6.333333

This works super nicely, and it's really speedy.

I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:

> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
  x y y_1
1 1 2   0
2 2 1   2
3 3 4   1

update the formula:

formula <- y ~ x * y_1

... and training will work just fine:

> model <- lm( formula, train )
> # no errors here

However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.

Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.

So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:

formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )

... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?

Answer

Dirk Eddelbuettel picture Dirk Eddelbuettel · Oct 27, 2012

Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.

Here is the beginning of its examples -- a one and twelve month lag:

R>      data("UKDriverDeaths", package = "datasets")
R>      uk <- log10(UKDriverDeaths)
R>      dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R>      dfm

Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)

Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))

Coefficients:
(Intercept)     L(uk, 1)    L(uk, 12)  
      0.183        0.431        0.511  

R>