sklearn LinearRegression.Predict() issue

Mellvinbaker picture Mellvinbaker · Mar 31, 2015 · Viewed 11.2k times · Source

I am trying to predict call volume for a call center based on various other factors. I have a fairly clean dataset, fairly small as well, but enough. I am able to train and test historical data and get a score, summary, etc. I am for the life of me unable to figure out how to then get it to predict future calls using forecasted factor data. My data is below:

Date    DayNum  factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9 VariableToPredict
9/17/2014   1   592 83686.46    0   0   250 15911.8 832 99598.26    177514  72
9/18/2014   2   1044    79030.09    0   0   203 23880.55    1238    102910.64   205064  274
9/19/2014   3   707 84207.27    0   0   180 8143.32 877 92350.59    156360  254
9/20/2014   4   707 97577.78    0   0   194 16688.95    891 114266.73   196526  208
9/21/2014   5   565 83084.57    0   0   153 13097.04    713 96181.61    143678  270

The code I have so far is below:

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import pandas as pd

d = pd.read_csv("H://My Documents//Python Scripts//RawData//Q2917.csv", "r", delimiter=",")
e = pd.read_csv("H://My Documents//Python Scripts//RawData//FY16q2917Test.csv", "r", delimiter=",")
#print(d)
#b = pd.DataFrame.as_matrix(d)
#print(b)
x = d.as_matrix(['factor2', 'factor4', 'factor5', 'factor6'])    
y = d.as_matrix(['VariableToPredict'])
x1 = e.as_matrix(['factor2', 'factor4', 'factor5', 'factor6'])
y1 = e.as_matrix(['VariableToPredict'])
#print(len(train))
#print(target)
#use scaler
scalerX = StandardScaler()
train = scalerX.fit_transform(x1)
scalerY = StandardScaler()
target = scalerY.fit_transform(y1)

clf = LinearRegression(fit_intercept=True)
cv = KFold(len(train), 10, shuffle=True, random_state=33)


#decf = LinearRegression.decision_function(train, target)
test = LinearRegression.predict(train, target)
score = cross_val_score(clf,train, target,cv=cv )

print("Score: {}".format(score.mean()))

This of course gives me the error that there are nulls in the y values, which there are because it is blank and I am trying to predict it. The problem here is, I am new enough to python that I am fundamentally misunderstanding how this should be built. even if it worked this way, it wouldn't be correct, it isn't taking into account the past data when building the model to predict the future. Do I need to have these in the same file possibly? if so, How to I tell it to consider these 3 columns from row a to row b, predict the dependent column for the same rows, then apply that model to analyze those three columns for the future data and predict the future calls. I don't expect the whole answer here, this is my job to do, but any small clues would be greatly appreciated.

Answer

StackG picture StackG · Apr 25, 2015

In order to build a regression model, you need training data and training scores. These allow you to fit a set of regression parameters to the problem.

Then to predict, you need prediction data, but NOT prediction scores, because you don't have these - you're trying to predict them!

The code below, for example, will run:

from sklearn.linear_model import LinearRegression
import numpy as np

trainingData = np.array([ [2.3,4.3,2.5], [1.3,5.2,5.2], [3.3,2.9,0.8], [3.1,4.3,4.0]  ])
trainingScores = np.array([3.4,7.5,4.5,1.6])

clf = LinearRegression(fit_intercept=True)
clf.fit(trainingData,trainingScores)

predictionData = np.array([ [2.5,2.4,2.7], [2.7,3.2,1.2] ])
clf.predict(predictionData)

It looks as though you're putting the wrong number of arguments into your predict() call - have a look at my snippet here and you should be able to work out how to change it.

Just for interest, you can run the following line afterwards to get access to the parameters that the regression fits to the data: print repr(clf.coef_)