-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704
Conversation
Tests passed on my machine. Root cause of the error is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right because it changes the variances. Maybe we should use np.minimum(np.var(X, 1), np.ptp(X, 1))
which more explicitly just handles rounding issues.
This needs a test, perhaps based on the original issue code
@jnothman |
I mean setting variances_ that way. Unless I'm much mistaken, ptp is a lower bound on various and so will help resolve numerical precision errors |
Cool. And the test should be a new function in estimator_checks.py that's yielded by _yield_all_checks if the estimator is a VarianceThreshold? |
No, it should be in sklearn/feature_selection/tests
|
So a new file for testing this functionality? Most of the existing tests for estimators seem to be in test_common.py as calls to functions in estimator_checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Co-Authored-By: rlms <macsweenroddy@gmail.com>
Co-Authored-By: rlms <macsweenroddy@gmail.com>
…/rlms/scikit-learn into variance-threshold-zero-variance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick but LGTM if all goes green
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick but LGTM if all goes green
@jnothman Checks pass, can this now be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v0.22.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks
Thanks @rlms! |
Reference Issues/PRs
Fixes #13691
What does this implement/fix? Explain your changes.
Currently, VarianceThreshold can fail to remove features that only contain one repeated value when the threshold is 0 due to floating point errors introduced by division in
np.var
. As proposed in #13691, this changes the implementation ofVarianceThreshold.fit
to usenp.ptp
(difference between max and min) rather thannp.var
when the threshold is 0 (and the analogousutils.sparsefuncs.min_max_axis
for sparse inputs). This avoids division and causes VarianceThreshold to behave as expected (verified by testing on the code in the issue).Any other comments?