[MRG + 2] fix selectFdr bug #7490

mpjlu · 2016-09-26T02:29:15Z

Reference Issue

What does this implement/fix? Explain your changes.

Description

selectFdr in scikit-learn/sklearn/feature_selection/univariate_selection.py:

def _get_support_mask(self):
    check_is_fitted(self, 'scores_')

    n_features = len(self.pvalues_)
    sv = np.sort(self.pvalues_)
    **selected = sv[sv <= float(self.alpha) / n_features * np.arange(n_features)]**
    if selected.size == 0:
        return np.zeros_like(self.pvalues_, dtype=bool)
    return self.pvalues_ <= selected.max()

Should Be:

    selected = sv[sv <= float(self.alpha) / n_features
                  * (np.arange(n_features) + 1)]

Because np.arange is start from 0, here it should be start from 1

Any other comments?

This change is

jnothman · 2016-09-26T11:03:28Z

I've inserted "Fixes #7474" in the issue description above.

jnothman · 2016-09-26T11:03:46Z

Tests aren't passing.

mpjlu · 2016-09-26T15:49:48Z

Any one can input some comments on this build failed. I am not familiar with this.

yl565 · 2016-09-26T17:50:10Z

sklearn/feature_selection/univariate_selection.py

@@ -596,7 +596,7 @@ def _get_support_mask(self):
        n_features = len(self.pvalues_)
        sv = np.sort(self.pvalues_)
        selected = sv[sv <= float(self.alpha) / n_features
-                      * np.arange(n_features)]
+                      * (np.arange(n_features) + 1)]


Looks like a pep8 problem. Line break should be after *.

selected = sv[sv <= float(self.alpha) / n_features * (np.arange(n_features) + 1)]

fwiw, i think pep8 has changed their stance on this, see [1,2]. It does seem to be what's causing the travis failure though (and we should keep consistency with the rest of the codebase).

[1] https://www.python.org/dev/peps/pep-0008/#should-a-line-break-before-or-after-a-binary-operator
[2] http://bugs.python.org/issue26763

I'd be happy to disable W503. I tend to agree with the more recent convention in terms of readability.

Strange that https://www.python.org/dev/peps/pep-0008 doesn't mention when the document was updated.

@jnothman , guido van rossum comments in the bug tracker on 4/15 that he updated the pep documentation. Interestingly, I thought that W503 is ignored by default in flake8

conda flake8 might be outdated.

jnothman · 2016-09-27T04:33:39Z

sklearn/feature_selection/tests/test_feature_select.py

@@ -404,7 +404,7 @@ def single_fdr(alpha, n_informative, random_state):
            # FDR = E(FP / (TP + FP)) <= alpha
            false_discovery_rate = np.mean([single_fdr(alpha, n_informative,
                                                       random_state) for
-                                            random_state in range(30)])
+                                            random_state in range(100)])


What is this change for?

FDR = E(FP / (TP + FP)) <= alpha

FDR is count by average many times (N) of "FP/(TP+FP)"
The larger of N, FDR is more accuracy.
Actually, N is the larger the better here.

Did it not pass without that change?

Yes, range 30 will not pass test

jnothman · 2016-09-28T23:21:19Z

sklearn/feature_selection/univariate_selection.py

-        selected = sv[sv <= float(self.alpha) / n_features
-                      * np.arange(n_features)]
+        selected = sv[sv <= float(self.alpha) / n_features *
+                      (np.arange(n_features) + 1)]


should be arange(1, n_features+1)

jnothman · 2016-09-29T02:59:53Z

In any case, I don't see how this test ensures that your change works. Could you please add a test that clearly would fail at master?

NelleV · 2016-09-29T06:02:29Z

Such a test could be that provided with only one p-value, that the function returns the approriate answer (ie, in this case that if the p-value is small enough but not 0, it should be selected - in master, it would only be selected if the p-value is equal to 0, but now it would be selected if the p-value is smaller than alpha).

mpjlu · 2016-09-29T09:15:50Z

thanks @NelleV, i agree withyou. this test is very simple and direct, is it necessary

jnothman · 2016-10-05T12:31:55Z

yes it is necessary.

mpjlu · 2016-10-10T05:06:26Z

hi @jnothman , sorry for late reply because of national holiday.
The clear case to show our change works is like the previous comment of @NelleV, but it is not easy to construct such a case based on the the current method to use make_regression to generate the rand sample data.
It is easy to manually construct the sample data to show our change works, but the code style is different with others, all others use sample generation method to generate the sample data.

I propose the following method to show our change works,
The test is to assert num_false_positives, and num_true_positives like this:

def test_select_fdr_regression2():
        X, y = make_regression(n_samples=150, n_features=20,
                               n_informative=5, shuffle=False,
                               random_state=0, noise=10)
        with warnings.catch_warnings(record=True):
            # Warnings can be raised when no features are selected
            # (low alpha or very noisy data)
            univariate_filter = SelectFdr(f_regression, alpha=0.01)
            univariate_filter.fit(X, y)
        support = univariate_filter.get_support()
        num_false_positives = np.sum(support[n_informative:] == 1)
        num_true_positives = np.sum(support[:n_informative] == 1) 
        assert_equal(num_false_positives, 5) 
        assert_equal(num_true_positives, 7)

How do you think about it?
Actually, there are many cases this change works and the master fail. we just need to write one here.

amueller · 2016-10-10T20:15:57Z

@mpjlu please format your code using backticks (I did it for you this time).

amueller · 2016-10-10T20:20:38Z

I don't understand the test. The FDR in this case is 5/12 which is about 0.5, while the test claims to control it at .01? Or am I misunderstanding something here?

mpjlu · 2016-10-11T00:49:41Z

Hi @amueller , the code in the last comment is just an code template, not the real data. Sorry for the misunderstanding.

amueller · 2016-10-13T20:12:23Z

ah...

mpjlu · 2016-10-14T01:07:35Z

Hi @jnothman , do you think I should add a test case (by manually construct the sample data) that clearly would fail at master? Any other work I need do for this PR?

jnothman · 2016-10-14T04:16:04Z

You should construct a test case that ensures that the implementation does what it should. Ideally that would make the off-by-one distinction violated presently.

mpjlu · 2016-10-17T07:21:21Z

hi @jnothman , I add a test case for this pr. For the sample data of the test, the master code cannot select one feature, it will select two features or select none. This PR can select 0, 1 or 2 features according to alpha.

amueller · 2016-10-17T21:21:46Z

sklearn/feature_selection/tests/test_feature_select.py

+    # by PR #7490, the result is array([True, False])
+    X = np.array([[10, 20], [20, 20], [20, 30]])
+    y = np.array([[1], [0], [0]])
+    univariate_filter = SelectFdr(chi2, alpha=0.1)


can you please assert that the p-values are the ones you specify above?

Yes, the p-values in comments are real p-values.

Do you mean I should assert the p-values in the test code?

yes please.

mpjlu · 2016-10-19T05:35:55Z

sorry, what does "add an entry to what's new" mean?

maniteja123 · 2016-10-19T06:19:17Z

Hi, since it is a bug fix, I suppose you can add a entry in what's new under bug fixes section. Hope it helps.

mpjlu · 2016-10-19T06:23:44Z

thanks @maniteja123

jnothman · 2016-10-19T12:07:59Z

doc/whats_new.rst

@@ -44,6 +44,10 @@ Enhancements
 Bug fixes
 .........

+   - :class:`sklearn.feature_selection.SelectFdr` now correctly works


A more descriptive message to tell people what was broken and is now fixed would be helpful. Often we use a message like "Fixed a bug where :class:feature_selection.SelectFdr did not ..."

how about"Fixed a bug where :class:feature_selection.SelectFdr did not accurately implement Benjamini-Hochberg procedure"

jnothman · 2016-10-19T12:47:23Z

doc/whats_new.rst

@@ -44,6 +44,11 @@ Enhancements
 Bug fixes
 .........

+   - Fix a bug where :class:`sklearn.feature_selection.SelectFdr` did not 
+     exactly implement Benjamini-Hochberg procedure


That's better, but indicating that it formerly may have selected fewer features than it should (and now does) would be more helpful for former users of this estimator.

updated, thanks.

mpjlu · 2016-10-19T15:49:16Z

Seems there is a build environment problem? any comments for this? Thanks.

maniteja123 · 2016-10-19T15:58:29Z

Hi, yeah you are right. It looks like that was some appveyor issue. I suppose it is fine since all the other builds succeed.

mpjlu · 2016-10-19T16:02:32Z

thanks @maniteja123

amueller

Looks good to me apart from adding the link.

amueller · 2016-10-19T19:12:22Z

doc/whats_new.rst

+     exactly implement Benjamini-Hochberg procedure. It formerly may have
+     selected fewer features than it should.
+     (`#7490 <https://github.com/scikit-learn/scikit-learn/pull/7490>`_) by
+     `Peng Meng`_.


You need to add yourself at the bottom of the page if you want your name to be a link.

Done, thanks very much.

# Conflicts: # doc/whats_new.rst

* tag '0.18.1': (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...

* releases: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ... Conflicts: removed sklearn/externals/joblib/__init__.py sklearn/externals/joblib/_parallel_backends.py sklearn/externals/joblib/testing.py

* dfsg: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...

fix selectFdr bug

520e107

more tests

4db45c8

yl565 reviewed Sep 26, 2016

View reviewed changes

Peng, Meng added 2 commits September 27, 2016 09:42

Merge remote-tracking branch 'scikit/master'

86c5a53

change position of operator

3cab18d

jnothman reviewed Sep 27, 2016

View reviewed changes

nelson-liu mentioned this pull request Sep 28, 2016

fixing issue #7474 fix selectFDR bug. #7509

Closed

jnothman requested changes Sep 28, 2016

View reviewed changes

mimor change

6cb1cc0

mpjlu mentioned this pull request Oct 17, 2016

[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) apache/spark#15212

Closed

Peng added 2 commits October 17, 2016 12:56

add test case

a69d584

remove unused variable

ac92743

mpjlu force-pushed the master branch from 076bd01 to 6cb1cc0 Compare October 17, 2016 05:18

fix flake8 error

a5a9dcf

amueller reviewed Oct 17, 2016

View reviewed changes

Peng added 2 commits October 19, 2016 14:04

Merge remote-tracking branch 'origin/master'

c2910b4

add an entry in what_new

5a53fa7

jnothman requested changes Oct 19, 2016

View reviewed changes

Peng added 2 commits October 19, 2016 20:16

change entry in whats_new

f61098f

change entry in whats_new

9e76fa5

jnothman reviewed Oct 19, 2016

View reviewed changes

jnothman approved these changes Oct 19, 2016

View reviewed changes

minor change

22de1f8

amueller added this to the 0.18.1 milestone Oct 19, 2016

amueller approved these changes Oct 19, 2016

View reviewed changes

amueller changed the title ~~[MGR + 1] fix selectFdr bug~~ [MGR + 2] fix selectFdr bug Oct 19, 2016

Peng, Meng added 2 commits October 20, 2016 09:56

add name as link

d2be567

fix conflict

d782968

jnothman changed the title ~~[MGR + 2] fix selectFdr bug~~ [MRG + 2] fix selectFdr bug Oct 20, 2016

jnothman merged commit 2caa144 into scikit-learn:master Oct 20, 2016

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 20, 2016

DOC add SelectFdr after merge of scikit-learn#7490

290c51a

amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 25, 2016

[MGR + 2] fix selectFdr bug (scikit-learn#7490)

e462c33

# Conflicts: # doc/whats_new.rst

amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

[MGR + 2] fix selectFdr bug (scikit-learn#7490)

7161022

# Conflicts: # doc/whats_new.rst

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MGR + 2] fix selectFdr bug (scikit-learn#7490)

69242f8

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MGR + 2] fix selectFdr bug (scikit-learn#7490)

78ad65a

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MGR + 2] fix selectFdr bug (scikit-learn#7490)

8dfecef

[MRG + 2] fix selectFdr bug #7490

[MRG + 2] fix selectFdr bug #7490

Conversation

mpjlu commented Sep 26, 2016 • edited by amueller Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

jnothman commented Sep 26, 2016 • edited Loading

jnothman commented Sep 26, 2016

mpjlu commented Sep 26, 2016

Choose a reason for hiding this comment

nelson-liu Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FDR = E(FP / (TP + FP)) <= alpha

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Sep 29, 2016

NelleV commented Sep 29, 2016

mpjlu commented Sep 29, 2016

jnothman commented Oct 5, 2016

mpjlu commented Oct 10, 2016 • edited by amueller Loading

amueller commented Oct 10, 2016

amueller commented Oct 10, 2016

mpjlu commented Oct 11, 2016

amueller commented Oct 13, 2016

mpjlu commented Oct 14, 2016

jnothman commented Oct 14, 2016

mpjlu commented Oct 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpjlu commented Oct 19, 2016

maniteja123 commented Oct 19, 2016

mpjlu commented Oct 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpjlu commented Oct 19, 2016

maniteja123 commented Oct 19, 2016

mpjlu commented Oct 19, 2016

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpjlu commented Sep 26, 2016 •

edited by amueller

Loading

jnothman commented Sep 26, 2016 •

edited

Loading

nelson-liu Sep 26, 2016 •

edited

Loading

mpjlu commented Oct 10, 2016 •

edited by amueller

Loading