Prevent scalers to scale near-constant features very large values #19527

ogrisel · 2021-02-22T18:31:29Z

Fixes #19450 (an edge case of a test initially written for #19426).

The original problem happens when fitting StandardScaler(with_mean=False) with sample_weight on constant (but non-zero) columns. Because of the sample weight, the variance is not exactly zero but a very small value.

This PRs tries to solve the problem by using a dtype-dependent eps threshold to detect constant features instead of a hard var == 0. test. I added more tests and also adapted sklearn.linear_model._base._preprocess_data so that linear models with normalize=True behave consistently.

Remaining issues:

sklearn.linear_model._base._preprocess_data is equivalent to StandardScaler up to a factor of np.sqrt(X.shape[0]) for all features except the ones that constant and non-zero. Those features are typically useless and get absorbed into the intercept but because of the interaction with regularization, the matching coef and intercept value will be different, even if the predictions are the same. Maybe this is not a problem. /cc @agramfort @maikia.
There a numerical stability issue to compute the variance of large but constant features on sparse data: see the tests marked XFAIL, for instance with a column with a constant value of 100., we get a variance that is far from zero (depends on n_samples). This should probably be tackled in a separate PR and tracked by a dedicated bug report in Weighted variance computation for sparse data is not numerically stable #19546.
Last constant features would stay large constant features with StandardScaler. This might feel surprising for some users. Shall we warn when we detect those? This would risk spurious warnings when cross-validating on small-ish datasets which could be annoying in practice. Update: we discussed this point at the last dev meeting and the opinion of letting those feature passthough was shared. Maybe we should add a waning but this can be tackled in a dedicated PR. I opened an issue to track the discussion on this point: [RFC] Should scalers or other estimators warn when fit on constant features? #19547.

thomasjpfan · 2021-02-22T19:20:18Z

I am guessing we can not use ptp, like we do in VarianceThreshold, because of the sample weights?

scikit-learn/sklearn/feature_selection/_variance_threshold.py

Lines 72 to 86 in 1000d0a

    
           if hasattr(X, "toarray"):   # sparse matrix 
        
               _, self.variances_ = mean_variance_axis(X, axis=0) 
        
               if self.threshold == 0: 
        
                   mins, maxes = min_max_axis(X, axis=0) 
        
                   peak_to_peaks = maxes - mins 
        
           else: 
        
               self.variances_ = np.nanvar(X, axis=0) 
        
               if self.threshold == 0: 
        
                   peak_to_peaks = np.ptp(X, axis=0) 
        
           if self.threshold == 0: 
        
               # Use peak-to-peak to avoid numeric precision issues 
        
               # for constant features 
        
               compare_arr = np.array([self.variances_, peak_to_peaks]) 
        
               self.variances_ = np.nanmin(compare_arr, axis=0)

ogrisel · 2021-02-22T19:55:07Z

I am guessing we can not use ptp, like we do in VarianceThreshold, because of the sample weights?

That's interesting, but indeed I am not sure what to do with sample weights and the problem is most often triggered when using sample weights in practice.

maikia

LGTM thanks for taking care of this issue @ogrisel

sklearn/linear_model/tests/test_base.py

sklearn/preprocessing/_data.py

ogrisel · 2021-02-24T10:37:05Z

@thomasjpfan @maikia @agramfort I addressed the comments that I think are directly related to the PR and opened dedicated issues for follow-up work: #19546 and #19547.

I think this PR could be merged as it is without waiting for either of the follow-up issues as it's already a bug fix as it stands.

agramfort

LGTM

thx @ogrisel and @maikia

sklearn/preprocessing/tests/test_data.py

maikia

LGTM too. Thanks @ogrisel

rth

Thanks for fixing this!

Fix scaler on near-constant features

945c8d3

github-actions bot added module:linear_model module:preprocessing labels Feb 22, 2021

ogrisel requested a review from agramfort February 22, 2021 18:32

ogrisel added 2 commits February 22, 2021 19:38

Remove useless import

660cc8e

Update changelog

59536de

maikia reviewed Feb 23, 2021

View reviewed changes

sklearn/linear_model/tests/test_base.py Show resolved Hide resolved

sklearn/preprocessing/_data.py Show resolved Hide resolved

ogrisel commented Feb 23, 2021

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

Update sklearn/preprocessing/_data.py

139d37d

ogrisel commented Feb 23, 2021

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

Update sklearn/preprocessing/_data.py

d3015eb

This was referenced Feb 24, 2021

Weighted variance computation for sparse data is not numerically stable #19546

Closed

[RFC] Should scalers or other estimators warn when fit on constant features? #19547

Open

agramfort approved these changes Feb 24, 2021

View reviewed changes

ogrisel commented Feb 24, 2021

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

Update sklearn/preprocessing/tests/test_data.py

b612785

maikia approved these changes Feb 24, 2021

View reviewed changes

rth approved these changes Feb 25, 2021

View reviewed changes

rth merged commit c748e46 into scikit-learn:main Feb 25, 2021

ogrisel deleted the constant-feature-scaling branch February 25, 2021 09:23

larsoner mentioned this pull request Mar 19, 2021

BUG: Regression with StandardScaler due to #19527 #19726

Closed

This was referenced Mar 29, 2021

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

Merged

[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

Merged

jeremiedbb mentioned this pull request Apr 14, 2021

Improve near constant feature detection in scalers #19898

Open

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Prevent scalers to scale near-constant features very large values #19527

Prevent scalers to scale near-constant features very large values #19527

Uh oh!

ogrisel commented Feb 22, 2021 •

edited

Loading

Uh oh!

thomasjpfan commented Feb 22, 2021

Uh oh!

ogrisel commented Feb 22, 2021

Uh oh!

maikia left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Feb 24, 2021

Uh oh!

agramfort left a comment

Uh oh!

Uh oh!

maikia left a comment

Uh oh!

rth left a comment

Uh oh!

Uh oh!

Uh oh!

Prevent scalers to scale near-constant features very large values #19527

Prevent scalers to scale near-constant features very large values #19527

Uh oh!

Conversation

ogrisel commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Feb 22, 2021

Uh oh!

ogrisel commented Feb 22, 2021

Uh oh!

maikia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Feb 24, 2021

Uh oh!

agramfort left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maikia left a comment

Choose a reason for hiding this comment

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Feb 22, 2021 •

edited

Loading