-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Description
linear_model.LinearRegression
seems to fit sparse matrices not as well as regular numpy arrays. I noticed a significant difference in a private dataset of mine, but there is still a small difference in the mean squared error of the linear regression example. Especially on such a small dataset (422 samples x 1 feature) I believe the coefficients and intercept should be exactly the same.
Steps/Code to Reproduce
Original example:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
Example modified with a sparse matrix:
import matplotlib.pyplot as plt
import numpy as np
import scipy
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression(fit_intercept=True)
# Train the model using the training sets
regr.fit(scipy.sparse.csr_matrix(diabetes_X_train), diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(scipy.sparse.csr_matrix(diabetes_X_test))
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
Expected Results
The original example prints a mean squared error of 2548.07
so I would expect the example with a sparse matrix to have the same error.
Actual Results
The modified example instead has a MSE of 2563.78
. Note that the difference between the two is higher on a higher-dimensional dataset of mine.
I tried the same code using linear_model.Ridge
instead of a linear regression. In that case, the MSE of the Ridge model is lower on the sparse matrix than on the regular numpy array. It's really weird.
Versions
System:
python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
executable: /home/bminixhofer/miniconda3/bin/python
machine: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid
BLAS:
macros:
lib_dirs:
cblas_libs: cblas
Python deps:
pip: 18.0
setuptools: 40.6.3
sklearn: 0.20.1
numpy: 1.16.1
scipy: 1.1.0
Cython: 0.29.4
pandas: 0.23.4