[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

jeremiedbb · 2021-03-30T16:38:06Z

Fixes #19726
This PR is built upon #19766 which should be merged first.

The criterion to decide is a feature is constant introduced in #19527 is not appropriate for some valid use-cases as reported in #19726.

This PR proposes to change the criterion to use the theoretical error bound on the algorithm used to compute the variance. If the computed is less than the upper bound, then we can't exclude the possibility that the real variance is 0. Thus we consider it a constant feature.

I improved a bit the precision of the variance computation in sparse. It was quite bad for very small variances (you can play with this ghist https://gist.github.com/jeremiedbb/224f8db4fb9990cdaf66df94345ea66a)
The idea behind this improvement is that we compute the mean with a bigger error than in dense (due to better precision of numpy sum (pairwise summation)). This error becomes a systematic error in sum(xi - mean)**2. Thus we introduce a correction term to try to compensate.

sklearn/utils/sparsefuncs_fast.pyx

ogrisel · 2021-03-30T17:02:14Z

Can you please add a new non-regression test derived from #19726?

ogrisel · 2021-03-30T17:05:05Z

There are probably other occurrences of similar issues introduced in the change of _handle_zeros_in_scale in #19527 (for instance for RobustScaler) but this can be addressed in a later PR to ease the reviewing process.

larsoner · 2021-03-30T17:06:12Z

Can you please add a new non-regression test derived from #19726?

In particular I would scale not just all entries as I did in that example, but also scale just a single column by the scale factor to ensure that there are no global effects / interactions between columns.

doc/whats_new/v1.0.rst

jeremiedbb · 2021-04-01T10:20:10Z

I made additional improvements to make test_standard_scaler_constant_features still pass for all params.

use float64 accumulators for sparse (it was already the case for dense).
use float64 accumulators more carefully in dense. The issue is that matmul has no dtype arg before numpy 1.16. But we decided to bump the min deps for the next release.

jeremiedbb · 2021-04-01T10:27:07Z

There are probably other occurrences of similar issues introduced in the change of _handle_zeros_in_scale in #19527 (for instance for RobustScaler) but this can be addressed in a later PR to ease the reviewing process.

The other scalers don't use variance based scale so the near constant feature detection added here does not apply.

ogrisel · 2021-04-01T12:08:12Z

The other scalers don't use variance based scale so the near constant feature detection added here does not apply.

But the arbitrary threshold defined in this line might be problematic:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L84

when used in the following locations:

and maybe others. But I agree we do not have the squaring effect of the variance. Still we might want to remove the 10x factor, I am not sure.

ogrisel

LGTM, just a few suggestions below:

sklearn/preprocessing/tests/test_data.py

sklearn/preprocessing/_data.py

ogrisel · 2021-04-06T16:30:40Z

@larsoner I believe this PR addresses all your concerns. Do you confirm?

@jnothman you might be interested in reviewing this as a follow-up fix for #19546.

larsoner · 2021-04-06T16:42:20Z

Yes, all good in variants of my MWE that I tried as well as in the original, more complicated MNE test case -- thanks for checking!

sklearn/preprocessing/tests/test_data.py

glemaitre

It might be worth to update the docstring of mean_variance_axis to mention that the computation of the variance will happen in float64

sklearn/linear_model/_base.py

sklearn/preprocessing/_data.py

sklearn/utils/extmath.py

glemaitre · 2021-04-14T11:35:06Z

Thanks @jeremiedbb

…cikit-learn#19788) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb added 10 commits March 25, 2021 17:27

compute weight sum manually to reduce rounding errors

7d5c4a8

use weights shape

2057369

avoid catastrophic cancellation

4d6ba86

cln

3947456

add what's new

eb52ff5

Merge branch 'master' into improve-sparse-variance-precision

5c06cc0

improve readability, remove 1 unecessary array

fce4257

add comments

7b43f52

improve constant feature detection for standard scaler

89ef3f4

cln

7834615

github-actions bot added module:linear_model module:preprocessing module:utils labels Mar 30, 2021

ogrisel reviewed Mar 30, 2021

View reviewed changes

sklearn/utils/sparsefuncs_fast.pyx Show resolved Hide resolved

ogrisel reviewed Mar 30, 2021

View reviewed changes

sklearn/utils/sparsefuncs_fast.pyx Show resolved Hide resolved

jeremiedbb mentioned this pull request Mar 30, 2021

BUG: Regression with StandardScaler due to #19527 #19726

Closed

add ref comment on the correction term

cc40a0c

ogrisel reviewed Mar 31, 2021

View reviewed changes

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

jeremiedbb added 5 commits March 31, 2021 11:04

Merge branch 'master' into fix-constant-feature-detection

88def6e

float64 accumulator for sparse; fix dense float32 upcast

fcb4b57

add test

d47f643

corrected 2 pass alg for dense

063e105

what's new

9c50e0d

cln

a944b5f

ogrisel added 3 commits April 6, 2021 17:36

Test for different values of n_samples

b77ed18

Even stronger test

ca7ab15

No need to test constant features many times + better mask name

bb1c728

ogrisel approved these changes Apr 6, 2021

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

jeremiedbb added 2 commits April 7, 2021 16:22

adress comments + parametrize mean

2ae7b23

cln

34e22e5

ogrisel reviewed Apr 7, 2021

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

jeremiedbb added 2 commits April 7, 2021 17:05

check var is small when detected as cste

d47ab9c

fix representable mask + avoid inf var

10a8ac6

NicolasHug mentioned this pull request Apr 7, 2021

CI Add a check for milestones. #19833

Merged

cmarmo added the Bug label Apr 7, 2021

ogrisel added the Waiting for Reviewer label Apr 8, 2021

thomasjpfan added the cython label Apr 13, 2021

glemaitre self-requested a review April 14, 2021 08:47

glemaitre removed the Waiting for Reviewer label Apr 14, 2021

glemaitre reviewed Apr 14, 2021

View reviewed changes

sklearn/linear_model/_base.py Show resolved Hide resolved

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

sklearn/utils/extmath.py Outdated Show resolved Hide resolved

glemaitre approved these changes Apr 14, 2021

View reviewed changes

address comments

b0f590c

glemaitre merged commit 684b7d1 into scikit-learn:main Apr 14, 2021

glemaitre modified the milestones: 0.24.2, 1.0 Apr 14, 2021

jeremiedbb mentioned this pull request Apr 14, 2021

Improve near constant feature detection in scalers #19898

Open

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021

FIX detect near constant feature in StandardScaler and linear models (s…

4eb0bad

…cikit-learn#19788) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

This was referenced Mar 12, 2022

_handle_zeros_in_scale causing improper scaling when using StandardScaler() #17794

Closed

Added tolerance to _handle_zeros_in_scale #17805

Closed

Uh oh!

[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

Uh oh!

Conversation

jeremiedbb commented Mar 30, 2021

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Mar 30, 2021

Uh oh!

ogrisel commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larsoner commented Mar 30, 2021

Uh oh!

Uh oh!

jeremiedbb commented Apr 1, 2021

Uh oh!

jeremiedbb commented Apr 1, 2021

Uh oh!

ogrisel commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Apr 6, 2021

Uh oh!

larsoner commented Apr 6, 2021

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Apr 14, 2021

Uh oh!

Uh oh!

ogrisel commented Mar 30, 2021 •

edited

Loading

ogrisel commented Apr 1, 2021 •

edited

Loading