-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fix near constant feature detection in StandardScaler and linear models #19788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix near constant feature detection in StandardScaler and linear models #19788
Conversation
Can you please add a new non-regression test derived from #19726? |
There are probably other occurrences of similar issues introduced in the change of |
In particular I would scale not just all entries as I did in that example, but also scale just a single column by the scale factor to ensure that there are no global effects / interactions between columns. |
I made additional improvements to make
|
The other scalers don't use variance based scale so the near constant feature detection added here does not apply. |
But the arbitrary threshold defined in this line might be problematic: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L84 when used in the following locations:
and maybe others. But I agree we do not have the squaring effect of the variance. Still we might want to remove the 10x factor, I am not sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few suggestions below:
Yes, all good in variants of my MWE that I tried as well as in the original, more complicated MNE test case -- thanks for checking! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth to update the docstring of mean_variance_axis
to mention that the computation of the variance will happen in float64
Thanks @jeremiedbb |
…cikit-learn#19788) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Fixes #19726
This PR is built upon #19766 which should be merged first.
The criterion to decide is a feature is constant introduced in #19527 is not appropriate for some valid use-cases as reported in #19726.
This PR proposes to change the criterion to use the theoretical error bound on the algorithm used to compute the variance. If the computed is less than the upper bound, then we can't exclude the possibility that the real variance is 0. Thus we consider it a constant feature.
I improved a bit the precision of the variance computation in sparse. It was quite bad for very small variances (you can play with this ghist https://gist.github.com/jeremiedbb/224f8db4fb9990cdaf66df94345ea66a)
The idea behind this improvement is that we compute the mean with a bigger error than in dense (due to better precision of numpy sum (pairwise summation)). This error becomes a systematic error in
sum(xi - mean)**2
. Thus we introduce a correction term to try to compensate.