Skip to content

[MRG + 2] fix selectFdr bug #7490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Oct 20, 2016
Merged

[MRG + 2] fix selectFdr bug #7490

merged 17 commits into from
Oct 20, 2016

Conversation

mpjlu
Copy link

@mpjlu mpjlu commented Sep 26, 2016

Reference Issue

Fixes #7474

What does this implement/fix? Explain your changes.

Description

selectFdr in scikit-learn/sklearn/feature_selection/univariate_selection.py:

def _get_support_mask(self):
    check_is_fitted(self, 'scores_')

    n_features = len(self.pvalues_)
    sv = np.sort(self.pvalues_)
    **selected = sv[sv <= float(self.alpha) / n_features * np.arange(n_features)]**
    if selected.size == 0:
        return np.zeros_like(self.pvalues_, dtype=bool)
    return self.pvalues_ <= selected.max()

Should Be:

    selected = sv[sv <= float(self.alpha) / n_features
                  * (np.arange(n_features) + 1)]

Because np.arange is start from 0, here it should be start from 1

Any other comments?


This change is Reviewable

@jnothman
Copy link
Member

jnothman commented Sep 26, 2016

I've inserted "Fixes #7474" in the issue description above.

@jnothman
Copy link
Member

Tests aren't passing.

@mpjlu
Copy link
Author

mpjlu commented Sep 26, 2016

Any one can input some comments on this build failed. I am not familiar with this.

@@ -596,7 +596,7 @@ def _get_support_mask(self):
n_features = len(self.pvalues_)
sv = np.sort(self.pvalues_)
selected = sv[sv <= float(self.alpha) / n_features
* np.arange(n_features)]
* (np.arange(n_features) + 1)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a pep8 problem. Line break should be after *.

selected = sv[sv <= float(self.alpha) / n_features *
              (np.arange(n_features) + 1)]

Copy link
Contributor

@nelson-liu nelson-liu Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, i think pep8 has changed their stance on this, see [1,2]. It does seem to be what's causing the travis failure though (and we should keep consistency with the rest of the codebase).

[1] https://www.python.org/dev/peps/pep-0008/#should-a-line-break-before-or-after-a-binary-operator
[2] http://bugs.python.org/issue26763

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to disable W503. I tend to agree with the more recent convention in terms of readability.

Strange that https://www.python.org/dev/peps/pep-0008 doesn't mention when the document was updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman , guido van rossum comments in the bug tracker on 4/15 that he updated the pep documentation. Interestingly, I thought that W503 is ignored by default in flake8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda flake8 might be outdated.

@@ -404,7 +404,7 @@ def single_fdr(alpha, n_informative, random_state):
# FDR = E(FP / (TP + FP)) <= alpha
false_discovery_rate = np.mean([single_fdr(alpha, n_informative,
random_state) for
random_state in range(30)])
random_state in range(100)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FDR = E(FP / (TP + FP)) <= alpha

FDR is count by average many times (N) of "FP/(TP+FP)"
The larger of N, FDR is more accuracy.
Actually, N is the larger the better here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it not pass without that change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, range 30 will not pass test

selected = sv[sv <= float(self.alpha) / n_features
* np.arange(n_features)]
selected = sv[sv <= float(self.alpha) / n_features *
(np.arange(n_features) + 1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be arange(1, n_features+1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jnothman
Copy link
Member

In any case, I don't see how this test ensures that your change works. Could you please add a test that clearly would fail at master?

@NelleV
Copy link
Member

NelleV commented Sep 29, 2016

Such a test could be that provided with only one p-value, that the function returns the approriate answer (ie, in this case that if the p-value is small enough but not 0, it should be selected - in master, it would only be selected if the p-value is equal to 0, but now it would be selected if the p-value is smaller than alpha).

@mpjlu
Copy link
Author

mpjlu commented Sep 29, 2016

thanks @NelleV, i agree withyou. this test is very simple and direct, is it necessary

@jnothman
Copy link
Member

jnothman commented Oct 5, 2016

yes it is necessary.

@mpjlu
Copy link
Author

mpjlu commented Oct 10, 2016

hi @jnothman , sorry for late reply because of national holiday.
The clear case to show our change works is like the previous comment of @NelleV, but it is not easy to construct such a case based on the the current method to use make_regression to generate the rand sample data.
It is easy to manually construct the sample data to show our change works, but the code style is different with others, all others use sample generation method to generate the sample data.

I propose the following method to show our change works,
The test is to assert num_false_positives, and num_true_positives like this:

def test_select_fdr_regression2():
        X, y = make_regression(n_samples=150, n_features=20,
                               n_informative=5, shuffle=False,
                               random_state=0, noise=10)
        with warnings.catch_warnings(record=True):
            # Warnings can be raised when no features are selected
            # (low alpha or very noisy data)
            univariate_filter = SelectFdr(f_regression, alpha=0.01)
            univariate_filter.fit(X, y)
        support = univariate_filter.get_support()
        num_false_positives = np.sum(support[n_informative:] == 1)
        num_true_positives = np.sum(support[:n_informative] == 1) 
        assert_equal(num_false_positives, 5) 
        assert_equal(num_true_positives, 7)

How do you think about it?
Actually, there are many cases this change works and the master fail. we just need to write one here.

@amueller
Copy link
Member

@mpjlu please format your code using backticks (I did it for you this time).

@amueller
Copy link
Member

I don't understand the test. The FDR in this case is 5/12 which is about 0.5, while the test claims to control it at .01? Or am I misunderstanding something here?

@mpjlu
Copy link
Author

mpjlu commented Oct 11, 2016

Hi @amueller , the code in the last comment is just an code template, not the real data. Sorry for the misunderstanding.

@amueller
Copy link
Member

ah...

@mpjlu
Copy link
Author

mpjlu commented Oct 14, 2016

Hi @jnothman , do you think I should add a test case (by manually construct the sample data) that clearly would fail at master? Any other work I need do for this PR?

@jnothman
Copy link
Member

You should construct a test case that ensures that the implementation does what it should. Ideally that would make the off-by-one distinction violated presently.

@mpjlu
Copy link
Author

mpjlu commented Oct 17, 2016

hi @jnothman , I add a test case for this pr. For the sample data of the test, the master code cannot select one feature, it will select two features or select none. This PR can select 0, 1 or 2 features according to alpha.

# by PR #7490, the result is array([True, False])
X = np.array([[10, 20], [20, 20], [20, 30]])
y = np.array([[1], [0], [0]])
univariate_filter = SelectFdr(chi2, alpha=0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please assert that the p-values are the ones you specify above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the p-values in comments are real p-values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I should assert the p-values in the test code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please.

@mpjlu
Copy link
Author

mpjlu commented Oct 19, 2016

sorry, what does "add an entry to what's new" mean?

@maniteja123
Copy link
Contributor

Hi, since it is a bug fix, I suppose you can add a entry in what's new under bug fixes section. Hope it helps.

@mpjlu
Copy link
Author

mpjlu commented Oct 19, 2016

thanks @maniteja123

@@ -44,6 +44,10 @@ Enhancements
Bug fixes
.........

- :class:`sklearn.feature_selection.SelectFdr` now correctly works
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more descriptive message to tell people what was broken and is now fixed would be helpful. Often we use a message like "Fixed a bug where :class:feature_selection.SelectFdr did not ..."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about"Fixed a bug where :class:feature_selection.SelectFdr did not accurately implement Benjamini-Hochberg procedure"

@@ -44,6 +44,11 @@ Enhancements
Bug fixes
.........

- Fix a bug where :class:`sklearn.feature_selection.SelectFdr` did not
exactly implement Benjamini-Hochberg procedure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better, but indicating that it formerly may have selected fewer features than it should (and now does) would be more helpful for former users of this estimator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks.

@mpjlu
Copy link
Author

mpjlu commented Oct 19, 2016

Seems there is a build environment problem? any comments for this? Thanks.

@maniteja123
Copy link
Contributor

Hi, yeah you are right. It looks like that was some appveyor issue. I suppose it is fine since all the other builds succeed.

@mpjlu
Copy link
Author

mpjlu commented Oct 19, 2016

thanks @maniteja123

@amueller amueller added this to the 0.18.1 milestone Oct 19, 2016
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me apart from adding the link.

exactly implement Benjamini-Hochberg procedure. It formerly may have
selected fewer features than it should.
(`#7490 <https://github.com/scikit-learn/scikit-learn/pull/7490>`_) by
`Peng Meng`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add yourself at the bottom of the page if you want your name to be a link.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks very much.

@amueller amueller changed the title [MGR + 1] fix selectFdr bug [MGR + 2] fix selectFdr bug Oct 19, 2016
@jnothman jnothman changed the title [MGR + 2] fix selectFdr bug [MRG + 2] fix selectFdr bug Oct 20, 2016
@jnothman jnothman merged commit 2caa144 into scikit-learn:master Oct 20, 2016
jnothman added a commit to jnothman/scikit-learn that referenced this pull request Oct 20, 2016
amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 25, 2016
# Conflicts:
#	doc/whats_new.rst
amueller added a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
# Conflicts:
#	doc/whats_new.rst
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* tag '0.18.1': (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* releases: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

 Conflicts:  removed
	sklearn/externals/joblib/__init__.py
	sklearn/externals/joblib/_parallel_backends.py
	sklearn/externals/joblib/testing.py
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* dfsg: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

univariate_selection -> feature_selection -> selectFDR is wrong ?
7 participants