Splitting data using time-based splitting in test and train datasets

dhruv bhardwaj picture dhruv bhardwaj · Jun 15, 2018 · Viewed 30.2k times · Source

I know that train_test_split splits it randomly, but I need to know how to split it based on time.

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 
  # this splits the data randomly as 67% test and 33% train

How to split the same data set based on time as 67% train and 33% test? The dataset has a column TimeStamp.

I tried searching on the similar questions but was not sure about the approach.

Can someone explain briefly?

Answer

zetadaro picture zetadaro · Jun 28, 2019

One easy way to do it..

First: sort the data by time

Second:

import numpy as np 
train_set, test_set= np.split(data, [int(.67 *len(data))])

That makes the train_set with the first 67% of the data, and the test_set with rest 33% of the data.