Skip to content

[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 14, 2021

Conversation

jeremiedbb
Copy link
Member

Fixes #19726
This PR is built upon #19766 which should be merged first.

The criterion to decide is a feature is constant introduced in #19527 is not appropriate for some valid use-cases as reported in #19726.

This PR proposes to change the criterion to use the theoretical error bound on the algorithm used to compute the variance. If the computed is less than the upper bound, then we can't exclude the possibility that the real variance is 0. Thus we consider it a constant feature.

I improved a bit the precision of the variance computation in sparse. It was quite bad for very small variances (you can play with this ghist https://gist.github.com/jeremiedbb/224f8db4fb9990cdaf66df94345ea66a)
The idea behind this improvement is that we compute the mean with a bigger error than in dense (due to better precision of numpy sum (pairwise summation)). This error becomes a systematic error in sum(xi - mean)**2. Thus we introduce a correction term to try to compensate.

@ogrisel
Copy link
Member

ogrisel commented Mar 30, 2021

Can you please add a new non-regression test derived from #19726?

@ogrisel
Copy link
Member

ogrisel commented Mar 30, 2021

There are probably other occurrences of similar issues introduced in the change of _handle_zeros_in_scale in #19527 (for instance for RobustScaler) but this can be addressed in a later PR to ease the reviewing process.

@larsoner
Copy link
Contributor

Can you please add a new non-regression test derived from #19726?

In particular I would scale not just all entries as I did in that example, but also scale just a single column by the scale factor to ensure that there are no global effects / interactions between columns.

@jeremiedbb
Copy link
Member Author

I made additional improvements to make test_standard_scaler_constant_features still pass for all params.

  • use float64 accumulators for sparse (it was already the case for dense).
  • use float64 accumulators more carefully in dense. The issue is that matmul has no dtype arg before numpy 1.16. But we decided to bump the min deps for the next release.

@jeremiedbb
Copy link
Member Author

There are probably other occurrences of similar issues introduced in the change of _handle_zeros_in_scale in #19527 (for instance for RobustScaler) but this can be addressed in a later PR to ease the reviewing process.

The other scalers don't use variance based scale so the near constant feature detection added here does not apply.

@ogrisel
Copy link
Member

ogrisel commented Apr 1, 2021

The other scalers don't use variance based scale so the near constant feature detection added here does not apply.

But the arbitrary threshold defined in this line might be problematic:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L84

when used in the following locations:

and maybe others. But I agree we do not have the squaring effect of the variance. Still we might want to remove the 10x factor, I am not sure.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few suggestions below:

@ogrisel
Copy link
Member

ogrisel commented Apr 6, 2021

@larsoner I believe this PR addresses all your concerns. Do you confirm?

@jnothman you might be interested in reviewing this as a follow-up fix for #19546.

@larsoner
Copy link
Contributor

larsoner commented Apr 6, 2021

Yes, all good in variants of my MWE that I tried as well as in the original, more complicated MNE test case -- thanks for checking!

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth to update the docstring of mean_variance_axis to mention that the computation of the variance will happen in float64

@glemaitre glemaitre merged commit 684b7d1 into scikit-learn:main Apr 14, 2021
@glemaitre glemaitre modified the milestones: 0.24.2, 1.0 Apr 14, 2021
@glemaitre
Copy link
Member

Thanks @jeremiedbb

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Regression with StandardScaler due to #19527
6 participants