-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 2] fix selectFdr bug #7490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've inserted "Fixes #7474" in the issue description above. |
Tests aren't passing. |
Any one can input some comments on this build failed. I am not familiar with this. |
@@ -596,7 +596,7 @@ def _get_support_mask(self): | |||
n_features = len(self.pvalues_) | |||
sv = np.sort(self.pvalues_) | |||
selected = sv[sv <= float(self.alpha) / n_features | |||
* np.arange(n_features)] | |||
* (np.arange(n_features) + 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a pep8 problem. Line break should be after *
.
selected = sv[sv <= float(self.alpha) / n_features *
(np.arange(n_features) + 1)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, i think pep8 has changed their stance on this, see [1,2]. It does seem to be what's causing the travis failure though (and we should keep consistency with the rest of the codebase).
[1] https://www.python.org/dev/peps/pep-0008/#should-a-line-break-before-or-after-a-binary-operator
[2] http://bugs.python.org/issue26763
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to disable W503. I tend to agree with the more recent convention in terms of readability.
Strange that https://www.python.org/dev/peps/pep-0008 doesn't mention when the document was updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman , guido van rossum comments in the bug tracker on 4/15 that he updated the pep documentation. Interestingly, I thought that W503 is ignored by default in flake8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conda flake8 might be outdated.
@@ -404,7 +404,7 @@ def single_fdr(alpha, n_informative, random_state): | |||
# FDR = E(FP / (TP + FP)) <= alpha | |||
false_discovery_rate = np.mean([single_fdr(alpha, n_informative, | |||
random_state) for | |||
random_state in range(30)]) | |||
random_state in range(100)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FDR = E(FP / (TP + FP)) <= alpha
FDR is count by average many times (N) of "FP/(TP+FP)"
The larger of N, FDR is more accuracy.
Actually, N is the larger the better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it not pass without that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, range 30 will not pass test
selected = sv[sv <= float(self.alpha) / n_features | ||
* np.arange(n_features)] | ||
selected = sv[sv <= float(self.alpha) / n_features * | ||
(np.arange(n_features) + 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be arange(1, n_features+1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
In any case, I don't see how this test ensures that your change works. Could you please add a test that clearly would fail at master? |
Such a test could be that provided with only one p-value, that the function returns the approriate answer (ie, in this case that if the p-value is small enough but not 0, it should be selected - in master, it would only be selected if the p-value is equal to 0, but now it would be selected if the p-value is smaller than |
thanks @NelleV, i agree withyou. this test is very simple and direct, is it necessary |
yes it is necessary. |
hi @jnothman , sorry for late reply because of national holiday. I propose the following method to show our change works, def test_select_fdr_regression2():
X, y = make_regression(n_samples=150, n_features=20,
n_informative=5, shuffle=False,
random_state=0, noise=10)
with warnings.catch_warnings(record=True):
# Warnings can be raised when no features are selected
# (low alpha or very noisy data)
univariate_filter = SelectFdr(f_regression, alpha=0.01)
univariate_filter.fit(X, y)
support = univariate_filter.get_support()
num_false_positives = np.sum(support[n_informative:] == 1)
num_true_positives = np.sum(support[:n_informative] == 1)
assert_equal(num_false_positives, 5)
assert_equal(num_true_positives, 7) How do you think about it? |
@mpjlu please format your code using backticks (I did it for you this time). |
I don't understand the test. The FDR in this case is 5/12 which is about 0.5, while the test claims to control it at .01? Or am I misunderstanding something here? |
Hi @amueller , the code in the last comment is just an code template, not the real data. Sorry for the misunderstanding. |
ah... |
Hi @jnothman , do you think I should add a test case (by manually construct the sample data) that clearly would fail at master? Any other work I need do for this PR? |
You should construct a test case that ensures that the implementation does what it should. Ideally that would make the off-by-one distinction violated presently. |
hi @jnothman , I add a test case for this pr. For the sample data of the test, the master code cannot select one feature, it will select two features or select none. This PR can select 0, 1 or 2 features according to alpha. |
# by PR #7490, the result is array([True, False]) | ||
X = np.array([[10, 20], [20, 20], [20, 30]]) | ||
y = np.array([[1], [0], [0]]) | ||
univariate_filter = SelectFdr(chi2, alpha=0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please assert that the p-values are the ones you specify above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the p-values in comments are real p-values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I should assert the p-values in the test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please.
sorry, what does "add an entry to what's new" mean? |
Hi, since it is a bug fix, I suppose you can add a entry in what's new under bug fixes section. Hope it helps. |
thanks @maniteja123 |
@@ -44,6 +44,10 @@ Enhancements | |||
Bug fixes | |||
......... | |||
|
|||
- :class:`sklearn.feature_selection.SelectFdr` now correctly works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more descriptive message to tell people what was broken and is now fixed would be helpful. Often we use a message like "Fixed a bug where :class:feature_selection.SelectFdr
did not ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about"Fixed a bug where :class:feature_selection.SelectFdr did not accurately implement Benjamini-Hochberg procedure"
@@ -44,6 +44,11 @@ Enhancements | |||
Bug fixes | |||
......... | |||
|
|||
- Fix a bug where :class:`sklearn.feature_selection.SelectFdr` did not | |||
exactly implement Benjamini-Hochberg procedure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better, but indicating that it formerly may have selected fewer features than it should (and now does) would be more helpful for former users of this estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks.
Seems there is a build environment problem? any comments for this? Thanks. |
Hi, yeah you are right. It looks like that was some appveyor issue. I suppose it is fine since all the other builds succeed. |
thanks @maniteja123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me apart from adding the link.
exactly implement Benjamini-Hochberg procedure. It formerly may have | ||
selected fewer features than it should. | ||
(`#7490 <https://github.com/scikit-learn/scikit-learn/pull/7490>`_) by | ||
`Peng Meng`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add yourself at the bottom of the page if you want your name to be a link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks very much.
# Conflicts: # doc/whats_new.rst
# Conflicts: # doc/whats_new.rst
* tag '0.18.1': (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
* releases: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ... Conflicts: removed sklearn/externals/joblib/__init__.py sklearn/externals/joblib/_parallel_backends.py sklearn/externals/joblib/testing.py
* dfsg: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
Reference Issue
Fixes #7474
What does this implement/fix? Explain your changes.
Description
selectFdr in scikit-learn/sklearn/feature_selection/univariate_selection.py:
Should Be:
Because np.arange is start from 0, here it should be start from 1
Any other comments?
This change is