LinearRegression with zero sample_weights is not the same as excluding those rows #26164

lorentzenchr · 2023-04-12T18:30:45Z

Describe the bug

Excluding rows having sample_weight == 0 in LinearRegression does not give the same results.

Steps/Code to Reproduce

from collections import Counter
import numpy as np
from sklearn.linear_model import LinearRegression

results = []
for i in range(100):
    rng = np.random.RandomState(i)
    n_samples, n_features = 10, 5
    X = rng.rand(n_samples, n_features)
    y = rng.rand(n_samples)
    reg = LinearRegression()
    sample_weight = rng.uniform(low=0.01, high=2, size=X.shape[0])
    sample_weight_0 = sample_weight.copy()
    sample_weight_0[-5:] = 0
    y[-5:] *= 1000  # to make excluding those samples important
    reg.fit(X, y, sample_weight=sample_weight_0)
    coef_0, intercept_0 = reg.coef_.copy(), reg.intercept_

    reg.fit(X[:-5], y[:-5], sample_weight=sample_weight[:-5])
    print(f"{coef_0=}")
    print(f"{reg.coef_=}")
    results.append(np.allclose(reg.coef_, coef_0, rtol=1e-6))
print(Counter(results))

Expected Results

Always True.

Actual Results

Counter({True: 79, False: 21})  # it fails 20% of the time

The print statement gives:

coef_0 =    array([ 1.43516166, -1.78826443,  0.15365526,  1.82233166, -1.6       ])
reg.coef_ = array([ 2.24022351, -1.04917851,  0.45341088,  0.80315086, -0.25726798])

Versions

System:
    python: 3.9.15 (main, Nov 15 2022, 05:24:15)  [Clang 14.0.0 (clang-1400.0.29.202)]
   machine: macOS-13.3.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
        numpy: 1.23.5
        scipy: 1.10.1
       Cython: 0.29.33

       user_api: blas
   internal_api: openblas
         prefix: libopenblas

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-04-13T18:28:57Z

Interestingly, I cannot reproduce:

coef_0=array([ 2.24022351, -1.04917851,  0.45341088,  0.80315086, -0.25726798])
reg.coef_=array([ 2.24022351, -1.04917851,  0.45341088,  0.80315086, -0.25726798])

System:
    python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python3.10
   machine: macOS-13.2.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.dev0
          pip: 22.3.1
   setuptools: 65.5.1
        numpy: 1.24.2
        scipy: 1.10.1
       Cython: 0.29.33
       pandas: 1.5.3
   matplotlib: 3.7.1
       joblib: 1.3.0.dev0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libopenblas.0.dylib
        version: 0.3.21
threading_layer: openmp
   architecture: VORTEX
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libomp.dylib
        version: None
    num_threads: 8

glemaitre · 2023-04-13T18:43:29Z

I can reproduce it on my Linux machine through:

System:
    python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
executable: /home/glemaitre/miniconda3/envs/dev/bin/python3.10
   machine: Linux-5.19.0-35-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.dev0
          pip: 22.3.1
   setuptools: 65.5.1
        numpy: 1.24.2
        scipy: 1.10.1
       Cython: 0.29.33
       pandas: 1.5.1
   matplotlib: 3.6.2
       joblib: 1.3.0.dev0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/glemaitre/miniconda3/envs/dev/lib/libopenblasp-r0.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/glemaitre/miniconda3/envs/dev/lib/libgomp.so.1.0.0
        version: None
    num_threads: 8

glemaitre · 2023-04-13T18:45:04Z

Note that my Mac has an ARM processor.

lorentzenchr · 2023-04-13T21:00:17Z

Could you try different random seeds on our MAC? I detected this bug while working on #15554 and then running SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -x sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency.

glemaitre · 2023-04-14T08:55:40Z

Indeed, this is an issue with the random seed. In both configs, it fails 20% of the time.

glemaitre · 2023-04-14T09:04:38Z

@lorentzenchr Does it mean that the scipy.linalg.lstsq is not stable and could be related to what @ogrisel experimented in #22855 (comment)?

lorentzenchr · 2023-04-14T11:34:06Z

Does it mean that the scipy.linalg.lstsq is not stable

I don't know. If so, we should report to scipy.
It could also be due to our handling of intercept and sample weights.

ogrisel · 2023-04-14T14:15:24Z

I confirm it also works on my M1 mac (similar versions for the dependencies and openblas).

@lorentzenchr can you reproduce the problem with Ridge(alpha=0) and various solvers on your linux machine? If it works with solver="lsqr" that would be one more reason to use lsqr as the default solver for LinearRegression.

Can you also please try to change the lapack driver used by LinearRegression's lstsq, e.g. gelsy instead of the gelsd default?

https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html

ogrisel · 2023-04-14T14:19:10Z

If you put 4 samples to 0 instead of 5, or alternatively increase the n_samples to make sure that the non-zero weighted data points are always more numerous than the number of features to get a non singular problem, do you still get the problem? It could be indeed related to the behavior I observed in #22855 (comment).

glemaitre · 2023-04-14T14:22:21Z

I edited the script of the summary to show that it fails in 20% of the time for 100 seeds.

ogrisel · 2023-04-14T14:24:38Z

I confirm that I also reproduce 20% failures with the new script.

I also confirm that using Ridge(alpha=0, solver="lsqr") works fine 100% of the time.

ogrisel · 2023-04-14T14:25:08Z

I will try to change the lapack driver and maybe play with the cond option.

ogrisel · 2023-04-14T14:32:15Z

With the gelsy driver I get a 13% failure rate. So not great either.

ogrisel · 2023-04-14T14:36:14Z

By setting cond=1e-15 with the default gelsd lapack driver, I get 100% pass and we get the same solution as Ridge(alpha=0, solver="lsqr") (I only checked the last seed).

But I am not sure how to set a good value for cond in general.

lorentzenchr · 2023-04-15T15:10:54Z

My take at this issue: It's the minimum norm issue with LinearRegression, see #22947 and #22855. The effective dimensions/shape of X are (5, 6) including the intercept, so an underdetermined system.

import numpy as np
from scipy.linalg import lstsq
import scipy.sparse as sparse
from sklearn.linear_model import LinearRegression


rng = np.random.RandomState(2)  # Seed dependent, np.allclose should be False.
n_samples, n_features = 10, 5
X = rng.rand(n_samples, n_features)
y = rng.rand(n_samples)
reg = LinearRegression()
sample_weight = rng.uniform(low=0.01, high=2, size=X.shape[0])
sample_weight_0 = sample_weight.copy()
sample_weight_0[-5:] = 0
y[-5:] *= 1000  # to make excluding those samples important

reg.fit(X, y, sample_weight=sample_weight_0)
coef_0 = np.r_[reg.coef_.copy(), reg.intercept_]

reg.fit(X[:-5], y[:-5], sample_weight=sample_weight[:-5])
coef_1 = np.r_[reg.coef_.copy(), reg.intercept_]

# For fun, we also use the lsqr solve by converting so sparse X
reg.fit(sparse.csr_array(X), y, sample_weight=sample_weight_0)
coef_2 = np.r_[reg.coef_.copy(), reg.intercept_] # same as coef_1

np.allclose(coef_0, coef_1, rtol=1e-6), np.allclose(coef_1, coef_2, rtol=1e-6)

Should be False, True.

# We solve with plain lstsq and intercept term.
X_with_intercept = np.c_[X, np.ones(10)]
coef_lstsq = lstsq(np.sqrt(sample_weight_0)[:, None] * X_with_intercept, np.sqrt(sample_weight_0) * y)[0]

# Norm of residues
[
    np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_0 - y)),
    np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_1 - y)),
    np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_lstsq - y)),
]

All around 1e-16. So all are valid solutions.

# Norm of coefficients
[
    np.linalg.norm(coef_0),
    np.linalg.norm(coef_1),
    np.linalg.norm(coef_lstsq),
]

[3.3496024918115093, 2.7998306762269527, 2.7862672453159094]
So the manual lstsq has the smallest L2-norm.

ogrisel · 2023-04-25T17:29:20Z

As discussed in #22947 the X_with_intercept approach can be problematic (for ridge) and think it's important that our LinearRegression and Ridge estimators deal with the intercept in a consistent manner.

So I would rather pass cond=1e-15 to lstsq in LinearRegression as a short term fix for the undedetermined case while keeping the current way to handle intercepts in OLS (that is fit_on_centered data followed by self.intercept_ = y_mean - self.coef_ @ X_mean) even if that means that the intercept does not take part in the computation of the norm of the minimal norm solution.

Later we can also discuss other solvers for LinearRegression to also improve its fit time performance #22855.

lorentzenchr · 2024-05-23T12:32:10Z

I close in favor/duplicate of #22947.

lorentzenchr added Bug Needs Triage Issue requires triage labels Apr 12, 2023

lorentzenchr mentioned this issue Apr 12, 2023

TST Add tests for LinearRegression that sample weights act consistently #15554

Merged

thomasjpfan added module:linear_model and removed Needs Triage Issue requires triage labels Apr 21, 2023

lorentzenchr mentioned this issue May 23, 2024

BUG unpenalized Ridge does not give minimum norm solution #22947

Open

lorentzenchr closed this as completed May 23, 2024

This was referenced Oct 8, 2024

FIX LinearRegression sample weight bug (numpy solver) #30030

Closed

Fix LinearRegression's numerical stability on rank deficient data by setting the cond parameter in the call to scipy.linalg.lstsq #30040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LinearRegression with zero sample_weights is not the same as excluding those rows #26164

LinearRegression with zero sample_weights is not the same as excluding those rows #26164

lorentzenchr commented Apr 12, 2023 •

edited by glemaitre

Loading

glemaitre commented Apr 13, 2023

glemaitre commented Apr 13, 2023

glemaitre commented Apr 13, 2023

lorentzenchr commented Apr 13, 2023

glemaitre commented Apr 14, 2023

glemaitre commented Apr 14, 2023

lorentzenchr commented Apr 14, 2023 •

edited

Loading

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023

glemaitre commented Apr 14, 2023

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023 •

edited

Loading

lorentzenchr commented Apr 15, 2023

ogrisel commented Apr 25, 2023

lorentzenchr commented May 23, 2024

LinearRegression with zero sample_weights is not the same as excluding those rows #26164

LinearRegression with zero sample_weights is not the same as excluding those rows #26164

Comments

lorentzenchr commented Apr 12, 2023 • edited by glemaitre Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Apr 13, 2023

glemaitre commented Apr 13, 2023

glemaitre commented Apr 13, 2023

lorentzenchr commented Apr 13, 2023

glemaitre commented Apr 14, 2023

glemaitre commented Apr 14, 2023

lorentzenchr commented Apr 14, 2023 • edited Loading

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023

glemaitre commented Apr 14, 2023

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023

ogrisel commented Apr 14, 2023 • edited Loading

lorentzenchr commented Apr 15, 2023

ogrisel commented Apr 25, 2023

lorentzenchr commented May 23, 2024

lorentzenchr commented Apr 12, 2023 •

edited by glemaitre

Loading

lorentzenchr commented Apr 14, 2023 •

edited

Loading

ogrisel commented Apr 14, 2023 •

edited

Loading