how to implement walk forward testing in sklearn?

PhilChang picture PhilChang · Aug 11, 2015 · Viewed 8.3k times · Source

In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:enter image description here

to cross validate a time series data, the training and testing data are often splitted like this:enter image description here

That is to say, the testing data should be always ahead of training data.

My thought is:

  1. Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.

  2. Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code grid_search.py and find it is a little complicated.

Any suggestion is welcome.

Answer

Matthijs Brouns picture Matthijs Brouns · Apr 10, 2017

I think you could use a Time Series Split either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.

After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit in this PR which seems like it does what you want.