Skip to content

[MRG + 2] fix selectFdr bug #7490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Oct 20, 2016
8 changes: 8 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@ Enhancements
Bug fixes
.........

- Fix a bug where :class:`sklearn.feature_selection.SelectFdr` did not
exactly implement Benjamini-Hochberg procedure. It formerly may have
selected fewer features than it should.
(`#7490 <https://github.com/scikit-learn/scikit-learn/pull/7490>`_) by
`Peng Meng`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add yourself at the bottom of the page if you want your name to be a link.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks very much.


- :class:`sklearn.manifold.LocallyLinearEmbedding` now correctly handles
integer inputs
(`#6282 <https://github.com/scikit-learn/scikit-learn/pull/6282>`_) by
Expand Down Expand Up @@ -4873,3 +4879,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Eugene Chen: https://github.com/eyc88

.. _Narine Kokhlikyan: https://github.com/NarineK

.. _Peng Meng: https://github.com/mpjlu
36 changes: 35 additions & 1 deletion sklearn/feature_selection/tests/test_feature_select.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,40 @@ def test_select_heuristics_regression():
assert_less(np.sum(support[5:] == 1), 3)


def test_boundary_case_ch2():
# Test boundary case, and always aim to select 1 feature.
X = np.array([[10, 20], [20, 20], [20, 30]])
y = np.array([[1], [0], [0]])
scores, pvalues = chi2(X, y)
assert_array_almost_equal(scores, np.array([4., 0.71428571]))
assert_array_almost_equal(pvalues, np.array([0.04550026, 0.39802472]))

filter_fdr = SelectFdr(chi2, alpha=0.1)
filter_fdr.fit(X, y)
support_fdr = filter_fdr.get_support()
assert_array_equal(support_fdr, np.array([True, False]))

filter_kbest = SelectKBest(chi2, k=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller did you mean this sort of thing, testing the select 1 feature boundary case for each strategy?

filter_kbest.fit(X, y)
support_kbest = filter_kbest.get_support()
assert_array_equal(support_kbest, np.array([True, False]))

filter_percentile = SelectPercentile(chi2, percentile=50)
filter_percentile.fit(X, y)
support_percentile = filter_percentile.get_support()
assert_array_equal(support_percentile, np.array([True, False]))

filter_fpr = SelectFpr(chi2, alpha=0.1)
filter_fpr.fit(X, y)
support_fpr = filter_fpr.get_support()
assert_array_equal(support_fpr, np.array([True, False]))

filter_fwe = SelectFwe(chi2, alpha=0.1)
filter_fwe.fit(X, y)
support_fwe = filter_fwe.get_support()
assert_array_equal(support_fwe, np.array([True, False]))


def test_select_fdr_regression():
# Test that fdr heuristic actually has low FDR.
def single_fdr(alpha, n_informative, random_state):
Expand Down Expand Up @@ -404,7 +438,7 @@ def single_fdr(alpha, n_informative, random_state):
# FDR = E(FP / (TP + FP)) <= alpha
false_discovery_rate = np.mean([single_fdr(alpha, n_informative,
random_state) for
random_state in range(30)])
random_state in range(100)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FDR = E(FP / (TP + FP)) <= alpha

FDR is count by average many times (N) of "FP/(TP+FP)"
The larger of N, FDR is more accuracy.
Actually, N is the larger the better here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it not pass without that change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, range 30 will not pass test

assert_greater_equal(alpha, false_discovery_rate)

# Make sure that the empirical false discovery rate increases
Expand Down
4 changes: 2 additions & 2 deletions sklearn/feature_selection/univariate_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -596,8 +596,8 @@ def _get_support_mask(self):

n_features = len(self.pvalues_)
sv = np.sort(self.pvalues_)
selected = sv[sv <= float(self.alpha) / n_features
* np.arange(n_features)]
selected = sv[sv <= float(self.alpha) / n_features *
np.arange(1, n_features + 1)]
if selected.size == 0:
return np.zeros_like(self.pvalues_, dtype=bool)
return self.pvalues_ <= selected.max()
Expand Down