Skip to content

Linear Regression performs worse on sparse matrix #13460

@bminixhofer

Description

@bminixhofer

Description

linear_model.LinearRegression seems to fit sparse matrices not as well as regular numpy arrays. I noticed a significant difference in a private dataset of mine, but there is still a small difference in the mean squared error of the linear regression example. Especially on such a small dataset (422 samples x 1 feature) I believe the coefficients and intercept should be exactly the same.

Steps/Code to Reproduce

Original example:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

Example modified with a sparse matrix:

import matplotlib.pyplot as plt
import numpy as np
import scipy
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression(fit_intercept=True)

# Train the model using the training sets
regr.fit(scipy.sparse.csr_matrix(diabetes_X_train), diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(scipy.sparse.csr_matrix(diabetes_X_test))

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

Expected Results

The original example prints a mean squared error of 2548.07 so I would expect the example with a sparse matrix to have the same error.

Actual Results

The modified example instead has a MSE of 2563.78. Note that the difference between the two is higher on a higher-dimensional dataset of mine.

I tried the same code using linear_model.Ridge instead of a linear regression. In that case, the MSE of the Ridge model is lower on the sparse matrix than on the regular numpy array. It's really weird.

Versions

System:
python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
executable: /home/bminixhofer/miniconda3/bin/python
machine: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid

BLAS:
macros:
lib_dirs:
cblas_libs: cblas

Python deps:
pip: 18.0
setuptools: 40.6.3
sklearn: 0.20.1
numpy: 1.16.1
scipy: 1.1.0
Cython: 0.29.4
pandas: 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions