difference between LinearRegression and svm.SVR(kernel="linear")

Dev_Man picture Dev_Man · Oct 27, 2017 · Viewed 9.8k times · Source

First there are questions on this forum very similar to this one but trust me none matches so no duplicating please.

I have encountered two methods of linear regression using scikit's sklearn and I am failing to understand the difference between the two, especially where in first code there's a method train_test_split() called while in the other one directly fit method is called.

I am studying with multiple resources and this single issue is very confusing to me.

First which uses SVR

X = np.array(df.drop(['label'], 1))

X = preprocessing.scale(X)

y = np.array(df['label'])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

clf = svm.SVR(kernel='linear')

clf.fit(X_train, y_train)

confidence = clf.score(X_test, y_test)

And second is this one

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

So my main focus is the difference between using svr(kernel="linear") and using LinearRegression()

Answer

Tushar Gupta picture Tushar Gupta · Oct 27, 2017

cross_validation.train_test_split : Splits arrays or matrices into random train and test subsets.

In second code, splitting is not random.

svm.SVR: The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.

Linear Regression: In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.

Reference: https://cs.adelaide.edu.au/~chhshen/teaching/ML_SVR.pdf