[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704

rlms · 2019-04-23T15:57:53Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Currently, VarianceThreshold can fail to remove features that only contain one repeated value when the threshold is 0 due to floating point errors introduced by division in np.var. As proposed in #13691, this changes the implementation of VarianceThreshold.fit to use np.ptp (difference between max and min) rather than np.var when the threshold is 0 (and the analogous utils.sparsefuncs.min_max_axis for sparse inputs). This avoids division and causes VarianceThreshold to behave as expected (verified by testing on the code in the issue).

Any other comments?

…t-learn#13691

rlms · 2019-04-23T18:19:22Z

Tests passed on my machine. Root cause of the error is TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' at sklearn\utils\sparsefuncs.py:344 when executing test_estimators[VarianceThreshold-check_estimator_sparse_data]. Not sure why that's happening.

jnothman

This isn't quite right because it changes the variances. Maybe we should use np.minimum(np.var(X, 1), np.ptp(X, 1)) which more explicitly just handles rounding issues.

This needs a test, perhaps based on the original issue code

rlms · 2019-04-24T09:32:45Z

@jnothman
Sure. Do you mean setting the variances to np.minimum(np.var(X, 1), np.ptp(X, 1)) or keeping self.variances_ as the variances but using the minimum for comparison against threshold? Also I think the axis should be 0 rather than 1. I'll add a test. There's also the issue of the check failing, which I'm struggling to debug because I can't reproduce it locally.

jnothman · 2019-04-24T09:36:35Z

I mean setting variances_ that way. Unless I'm much mistaken, ptp is a lower bound on various and so will help resolve numerical precision errors

rlms · 2019-04-24T09:54:00Z

Cool. And the test should be a new function in estimator_checks.py that's yielded by _yield_all_checks if the estimator is a VarianceThreshold?

jnothman · 2019-04-24T10:19:47Z

No, it should be in sklearn/feature_selection/tests

rlms · 2019-04-24T10:28:19Z

So a new file for testing this functionality? Most of the existing tests for estimators seem to be in test_common.py as calls to functions in estimator_checks

jnothman · 2019-04-24T11:02:42Z

A new function in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/tests/test_variance_threshold.py, please

jnothman

Thanks!

sklearn/feature_selection/tests/test_variance_threshold.py

sklearn/feature_selection/variance_threshold.py

sklearn/feature_selection/tests/test_variance_threshold.py

Co-Authored-By: rlms <macsweenroddy@gmail.com>

…/rlms/scikit-learn into variance-threshold-zero-variance

NicolasHug

Nitpick but LGTM if all goes green

sklearn/feature_selection/tests/test_variance_threshold.py

NicolasHug

Nitpick but LGTM if all goes green

…d-zero-variance

rlms · 2019-05-24T21:19:46Z

@jnothman Checks pass, can this now be merged?

jnothman

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

…d-zero-variance

doc/whats_new/v0.22.rst

jnothman

Okay, thanks

jnothman · 2019-05-28T15:54:40Z

Thanks @rlms!

…cikit-learn#13691 (scikit-learn#13704)

Changed VarianceThreshold behaviour when threshold is zero. See sciki…

35afc42

…t-learn#13691

jnothman reviewed Apr 24, 2019

View reviewed changes

rlmacsween added 2 commits April 24, 2019 12:23

Slightly modified change to VarianceThreshold and added test

330d5de

Removed blank lines from end of file

d00257b

jnothman reviewed Apr 24, 2019

View reviewed changes

sklearn/feature_selection/tests/test_variance_threshold.py Outdated Show resolved Hide resolved

sklearn/feature_selection/variance_threshold.py Outdated Show resolved Hide resolved

rlmacsween added 2 commits April 24, 2019 13:09

Minor changes to new VarianceThreshold behaviour

77bda79

Commented test

4ac4504

NicolasHug reviewed Apr 24, 2019

View reviewed changes

sklearn/feature_selection/variance_threshold.py Outdated Show resolved Hide resolved

sklearn/feature_selection/variance_threshold.py Outdated Show resolved Hide resolved

sklearn/feature_selection/tests/test_variance_threshold.py Outdated Show resolved Hide resolved

NicolasHug and others added 4 commits April 24, 2019 16:18

Update sklearn/feature_selection/variance_threshold.py

ea02681

Co-Authored-By: rlms <macsweenroddy@gmail.com>

Update sklearn/feature_selection/variance_threshold.py

52ca4f5

Co-Authored-By: rlms <macsweenroddy@gmail.com>

Changed test format

b6fd15a

Merge branch 'variance-threshold-zero-variance' of https://github.com…

661595b

…/rlms/scikit-learn into variance-threshold-zero-variance

NicolasHug approved these changes Apr 24, 2019

View reviewed changes

sklearn/feature_selection/tests/test_variance_threshold.py Outdated Show resolved Hide resolved

NicolasHug approved these changes Apr 24, 2019

View reviewed changes

Reformatted assertion in test

f23d5ab

rlms mentioned this pull request Apr 27, 2019

utils.sparsefuncs.min_max_axis gives TypeError when input is large csc matrix when OS is 32 bit Windows #13737

Closed

rlms mentioned this pull request May 23, 2019

[MRG] Downcast large matrix indices where possible in sparsefuncs._minor_reduce (fix #13737) #13741

Merged

Merge remote-tracking branch 'upstream/master' into variance-threshol…

0de2bd6

…d-zero-variance

jnothman approved these changes May 26, 2019

View reviewed changes

rlmacsween and others added 3 commits May 26, 2019 17:18

Merge remote-tracking branch 'upstream/master' into variance-threshol…

060ba4b

…d-zero-variance

Updated changelog

6a65f33

Merge branch 'master' into variance-threshold-zero-variance

72bf8ee

jnothman reviewed May 28, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

Update v0.22.rst

4d6a93b

jnothman reviewed May 28, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

Update v0.22.rst

c958c95

jnothman reviewed May 28, 2019

View reviewed changes

jnothman merged commit be03467 into scikit-learn:master May 28, 2019

rlms deleted the variance-threshold-zero-variance branch May 28, 2019 16:08

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX Changed VarianceThreshold behaviour when threshold is zero. See s…

b14824a

…cikit-learn#13691 (scikit-learn#13704)

Uh oh!

[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704

[MRG] Changed VarianceThreshold behaviour when threshold is zero. See #13691 #13704

Conversation

rlms commented Apr 23, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rlms commented Apr 23, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rlms commented Apr 24, 2019

Uh oh!

jnothman commented Apr 24, 2019

Uh oh!

rlms commented Apr 24, 2019

Uh oh!

jnothman commented Apr 24, 2019 via email

Uh oh!

rlms commented Apr 24, 2019

Uh oh!

jnothman commented Apr 24, 2019 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

rlms commented May 24, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 28, 2019

Uh oh!

Uh oh!