Time-series - data splitting and model evaluation

Jot eN picture Jot eN · Jul 15, 2014 · Viewed 27.9k times · Source

I've tried to use machine learning to make prediction based on time-series data. In one of the stackoverflow question (createTimeSlices function in CARET package in R) is an example of using createTimeSlices to cross-validation for model training and parameter tuning:

    library(caret)
    library(ggplot2)
    library(pls)
    data(economics)
    myTimeControl <- trainControl(method = "timeslice",
                                  initialWindow = 36,
                                  horizon = 12,
                                  fixedWindow = TRUE)

    plsFitTime <- train(unemploy ~ pce + pop + psavert,
                        data = economics,
                        method = "pls",
                        preProc = c("center", "scale"),
                        trControl = myTimeControl)

My understanding is:

  1. I need to split may data to training and test set.
  2. Use training set for parameters tuning.
  3. Evaluate obtained model on the test set (using R2, RMSE, etc.)

Because my data is time-series, I suppose that I cannot use bootstraping for spliting data into training and test set. So, my questions are: Am I right? And If so - How to use createTimeSlices for model evaluation?

Answer

Shambho picture Shambho · Aug 2, 2014

Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.

However, here is how to use createTimeSlices for splitting the data and then using it for training and testing a model.

Step 0: Setting up the data and trainControl:(from your question)

library(caret)
library(ggplot2)
library(pls)

data(economics)

Step 1: Creating the timeSlices for the index of the data:

timeSlices <- createTimeSlices(1:nrow(economics), 
                   initialWindow = 36, horizon = 12, fixedWindow = TRUE)

This creates a list of training and testing timeSlices.

> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
##   .. [list output truncated]
## $ test :List of 431
##   .. [list output truncated]

For ease of understanding, I am saving them in separate variable:

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

Step 2: Training on the first of the trainSlices:

plsFitTime <- train(unemploy ~ pce + pop + psavert,
                    data = economics[trainSlices[[1]],],
                    method = "pls",
                    preProc = c("center", "scale"))

Step 3: Testing on the first of the trainSlices:

pred <- predict(plsFitTime,economics[testSlices[[1]],])

Step 4: Plotting:

true <- economics$unemploy[testSlices[[1]]]

plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue") 

You can then do this for all the slices:

for(i in 1:length(trainSlices)){
  plsFitTime <- train(unemploy ~ pce + pop + psavert,
                      data = economics[trainSlices[[i]],],
                      method = "pls",
                      preProc = c("center", "scale"))
  pred <- predict(plsFitTime,economics[testSlices[[i]],])


  true <- economics$unemploy[testSlices[[i]]]
  plot(true, col = "red", ylab = "true (red) , pred (blue)", 
            main = i, ylim = range(c(pred,true)))
  points(pred, col = "blue") 
}

As mentioned earlier, this sort of timeSlicing is done by your original function in one step:

> myTimeControl <- trainControl(method = "timeslice",
+                               initialWindow = 36,
+                               horizon = 12,
+                               fixedWindow = TRUE)
> 
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+                     data = economics,
+                     method = "pls",
+                     preProc = c("center", "scale"),
+                     trControl = myTimeControl)
> plsFitTime
Partial Least Squares 

478 samples
  5 predictors

Pre-processing: centered, scaled 
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window) 

Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... 

Resampling results across tuning parameters:

  ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
  1      1080  0.443     796      0.297      
  2      1090  0.43      845      0.295      

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 1. 

Hope this helps!!