How to find the best degree of polynomials?

Billy Chow picture Billy Chow · Nov 22, 2017 · Viewed 9.3k times · Source

I'm new to Machine Learning and currently got stuck with this. First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regression to reduce the bias.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)

poly_predict = poly_reg.predict(X_poly)
poly_mse = mean_squared_error(X, poly_predict)
poly_rmse = np.sqrt(poly_mse)
poly_rmse

Then I got slightly better result than linear regression, then I continued to set degree = 3/4/5, the result kept getting better. But it might be somewhat overfitting as degree increased.

The best degree of polynomial should be the degree that generates the lowest RMSE in cross validation set. But I don't have any idea how to achieve that. Should I use GridSearchCV? or any other method?

Much appreciate if you could me with this.

Answer

PythonNoob picture PythonNoob · Feb 4, 2018

In my opinion, the best way to find an optimal curve fitting degree or in general a fitting model is to use the GridSearchCV module from the scikit-learn library.

Here is an example how to use this library:

Firstly let us define a method to sample random data:

def make_data(N, err=1.0, rseed=1):

    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 1. / (X.ravel() + 0.3)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

Build a pipeline:

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))

Create a data and a vector(X_test) for testing and visualisation purposes:

X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]

Define the GridSearchCV parameters:

param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)

Get the best parameters from our model:

model = grid.best_estimator_
model

Pipeline(memory=None,
     steps=[('polynomialfeatures', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

Fit the model with the X and y data and use the vector to predict the values:

y_test = model.fit(X, y).predict(X_test)

Visualize the result:

plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')

The best fit result

The full code snippet:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

def make_data(N, err=1.0, rseed=1):

    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 1. / (X.ravel() + 0.3)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))


X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]

param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)

model = grid.best_estimator_

y_test = model.fit(X, y).predict(X_test)

plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')