-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Prevent scalers to scale near-constant features very large values #19527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am guessing we can not use scikit-learn/sklearn/feature_selection/_variance_threshold.py Lines 72 to 86 in 1000d0a
|
That's interesting, but indeed I am not sure what to do with sample weights and the problem is most often triggered when using sample weights in practice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for taking care of this issue @ogrisel
@thomasjpfan @maikia @agramfort I addressed the comments that I think are directly related to the PR and opened dedicated issues for follow-up work: #19546 and #19547. I think this PR could be merged as it is without waiting for either of the follow-up issues as it's already a bug fix as it stands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too. Thanks @ogrisel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
Fixes #19450 (an edge case of a test initially written for #19426).
The original problem happens when fitting
StandardScaler(with_mean=False)
withsample_weight
on constant (but non-zero) columns. Because of the sample weight, the variance is not exactly zero but a very small value.This PRs tries to solve the problem by using a dtype-dependent eps threshold to detect constant features instead of a hard
var == 0.
test. I added more tests and also adaptedsklearn.linear_model._base._preprocess_data
so that linear models withnormalize=True
behave consistently.Remaining issues:
sklearn.linear_model._base._preprocess_data
is equivalent to StandardScaler up to a factor ofnp.sqrt(X.shape[0])
for all features except the ones that constant and non-zero. Those features are typically useless and get absorbed into the intercept but because of the interaction with regularization, the matching coef and intercept value will be different, even if the predictions are the same. Maybe this is not a problem. /cc @agramfort @maikia.There a numerical stability issue to compute the variance of large but constant features on sparse data: see the tests marked
XFAIL
, for instance with a column with a constant value of100.
, we get a variance that is far from zero (depends onn_samples
). This should probably be tackled in a separate PR and tracked by a dedicated bug report in Weighted variance computation for sparse data is not numerically stable #19546.Last constant features would stay large constant features with
StandardScaler
. This might feel surprising for some users. Shall we warn when we detect those? This would risk spurious warnings when cross-validating on small-ish datasets which could be annoying in practice. Update: we discussed this point at the last dev meeting and the opinion of letting those feature passthough was shared. Maybe we should add a waning but this can be tackled in a dedicated PR. I opened an issue to track the discussion on this point: [RFC] Should scalers or other estimators warn when fit on constant features? #19547.