I tried this but couldn't get it to work for my data: Use Scikit Learn to do linear regression on a time series pandas data frame
My data consists of 2 DataFrames. DataFrame_1.shape = (40,5000)
and DataFrame_2.shape = (40,74)
. I'm trying to do some type of linear regression, but DataFrame_2
contains NaN
missing data values. When I DataFrame_2.dropna(how="any")
the shape drops to (2,74)
.
Is there any linear regression algorithm in sklearn that can handle NaN
values?
I'm modeling it after the load_boston
from sklearn.datasets
where X,y = boston.data, boston.target = (506,13),(506,)
Here's my simplified code:
X = DataFrame_1
for col in DataFrame_2.columns:
y = DataFrame_2[col]
model = LinearRegression()
model.fit(X,y)
#ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I did the above format to get the shapes to match up of the matrices
If posting the DataFrame_2
would help, please comment below and I'll add it.
You can fill in the null values in y
with imputation. In scikit-learn
this is done with the following code snippet:
from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)
Otherwise, you might want to build your model using a subset of the 74 columns as predictors, perhaps some of your columns contain less null values?