Skip to content

[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 28, 2019

Conversation

rlms
Copy link
Contributor

@rlms rlms commented Apr 23, 2019

Reference Issues/PRs

Fixes #13691

What does this implement/fix? Explain your changes.

Currently, VarianceThreshold can fail to remove features that only contain one repeated value when the threshold is 0 due to floating point errors introduced by division in np.var. As proposed in #13691, this changes the implementation of VarianceThreshold.fit to use np.ptp (difference between max and min) rather than np.var when the threshold is 0 (and the analogous utils.sparsefuncs.min_max_axis for sparse inputs). This avoids division and causes VarianceThreshold to behave as expected (verified by testing on the code in the issue).

Any other comments?

@rlms
Copy link
Contributor Author

rlms commented Apr 23, 2019

Tests passed on my machine. Root cause of the error is TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' at sklearn\utils\sparsefuncs.py:344 when executing test_estimators[VarianceThreshold-check_estimator_sparse_data]. Not sure why that's happening.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right because it changes the variances. Maybe we should use np.minimum(np.var(X, 1), np.ptp(X, 1)) which more explicitly just handles rounding issues.

This needs a test, perhaps based on the original issue code

@rlms
Copy link
Contributor Author

rlms commented Apr 24, 2019

@jnothman
Sure. Do you mean setting the variances to np.minimum(np.var(X, 1), np.ptp(X, 1)) or keeping self.variances_ as the variances but using the minimum for comparison against threshold? Also I think the axis should be 0 rather than 1. I'll add a test. There's also the issue of the check failing, which I'm struggling to debug because I can't reproduce it locally.

@jnothman
Copy link
Member

I mean setting variances_ that way. Unless I'm much mistaken, ptp is a lower bound on various and so will help resolve numerical precision errors

@rlms
Copy link
Contributor Author

rlms commented Apr 24, 2019

Cool. And the test should be a new function in estimator_checks.py that's yielded by _yield_all_checks if the estimator is a VarianceThreshold?

@jnothman
Copy link
Member

jnothman commented Apr 24, 2019 via email

@rlms
Copy link
Contributor Author

rlms commented Apr 24, 2019

So a new file for testing this functionality? Most of the existing tests for estimators seem to be in test_common.py as calls to functions in estimator_checks

@jnothman
Copy link
Member

jnothman commented Apr 24, 2019 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick but LGTM if all goes green

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick but LGTM if all goes green

@rlms
Copy link
Contributor Author

rlms commented May 24, 2019

@jnothman Checks pass, can this now be merged?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks

@jnothman jnothman merged commit be03467 into scikit-learn:master May 28, 2019
@jnothman
Copy link
Member

Thanks @rlms!

@rlms rlms deleted the variance-threshold-zero-variance branch May 28, 2019 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VarianceThreshold doesn't remove feature with zero variance
4 participants