train_test_split( ) method of scikit learn

NoobySavage picture NoobySavage · Sep 2, 2019 · Viewed 10.6k times · Source

I am trying to create a machine learning model using DecisionTreeClassifier. To train & test my data I imported train_test_split method from scikit learn. But I can not understand one of its arguments called random_state.

What is the significance of assigning numeric values to random_state of model_selection.train_test_split function and how may I know which numeric value to assign random_state for my decision tree?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

Answer

desertnaut picture desertnaut · Sep 2, 2019

As the docs mention, random_state is for the initialization of the random number generator used in train_test_split (similarly for other methods, as well). As there are many different ways to actually split a dataset, this is to ensure that you can use the method several times with the same dataset (e.g. in a series of experiments) and always get the same result (i.e. the exact same train and test sets here), i.e for reproducibility reasons. Its exact value is not important and is not something you have to worry about.

Using the example in the docs, setting random_state=42 ensures that you get the exact same result shown there (the code below is actually run in my machine, and not copy-pasted from the docs):

import numpy as np
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

X_train
# array([[4, 5],
#        [0, 1],
#        [6, 7]])

y_train
# [2, 0, 3]

X_test
# array([[2, 3],
#        [8, 9]])

y_test
# [1, 4]

You should experiment yourself with different values for random_state (or without specifying it at all) in the above snippet to get the feeling.