-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+2] Enable grid search with classifiers that may fail on individual fits. #2795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Enable grid search with classifiers that may fail on individual fits. #2795
Conversation
self.parameter = parameter | ||
|
||
def fit(self, X, y=None): | ||
raise ValueError("Failing classifier failed as requiered") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead do:
if self.parameter:
raise ...
return self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"requiered" -> "required"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does raise ...
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was writing shorthand. I meant that your test would be more robust if the classifier didn't always fail, so you can check it doesn't give zero score to all parameters on failure
I think You might as well warn always; use a custom warning class and the user can ignore it with |
if return_train_score: | ||
train_score = fit_exception_score | ||
warnings.warn("Classifier fit failed. The score on this train-test" | ||
" partition will be set to zero. Details: " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'zero' should be variable: format with %0.1f, and %r for the exception.
I made some changes according to the feedback received here. Meanwhile, I'm also having problems with some tests failing but these tests seem to be unrelated to the code I changed:
|
train_score = exception_score | ||
warnings.warn("Classifier fit failed. The score on this train-test" | ||
" partition will be set to %f. Details: \n%r" % | ||
(exception_score, e), RuntimeWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please sub-class RuntimeWarning
so that the user can easily filter these warnings.
I'm still not convinced you need separate parameters 'skip_exceptions' and 'exception_score', but all in all this is looking good. Does someone else want to pitch in? @ogrisel, you've mentioned failing CV before... |
@jnothman, I subclassed RuntimeWarning as you suggested. Any more ideas for improvement? |
I agree. Less parameters is better. |
Can you explain how I can do this with one parameter? |
For example: GridSearchCV(on_error='raise') will do what the code currently does: raise GridSearchCV(on_error=0) will return 0 in place of the breaking fold. This assumes no one uses a class label called 'raise', but that may be a On 12 February 2014 02:15, Michal Romaniuk notifications@github.com wrote:
|
@jnothman @GaelVaroquaux I'm not too keen on having one parameter with a special value but I changed it as you requested. Anything else? |
@@ -1142,6 +1147,10 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, | |||
verbose : integer | |||
The verbosity level. | |||
|
|||
on_error : numeric or 'raise' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I would call this parameter 'error_score'. When set to 'raise' it would indeed raise the exception.
@jnothman : what do you think of the proposed name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind what we call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with error_score
.
I am OK with the general lines of this PR. I made just a few minor comments, the only one that might be a bit subject to discussion is the proposed name change for the argument. Also, could you rebase your branch on master: it is not mergeable in the current state. Thanks a lot for your work. |
Hi @GaelVaroquaux and @jnothman, I'm back to working on this after a long-ish break. I made the requested changes but it looks like something went wrong when I was trying to figure out how to do the rebase. Can you tell me if it's ok, and if not then what I should do? |
I've not looked at the damage (though it looks like you've done a merge rather than a rebase), but you can just do the rebase again if it's not arduous:
|
Rebase done. |
|
||
def _fit_and_score(estimator, X, y, scorer, train, test, verbose, | ||
parameters, fit_params, return_train_score=False, | ||
return_parameters=False, skip_exceptions=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still have the skip_exceptions
parameter though it does nothing.
We have parameter overloading all over the place in scikit-learn. In some places it's arguably more problematic than here, where 'raise' simply can't be a valid score, and score is irrelevant if an error is to be raised. |
Thanks @jnothman for reviewing my code. I corrected the mistakes you pointed out and I hope it's all well now. |
@@ -1141,6 +1145,11 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, | |||
verbose : integer | |||
The verbosity level. | |||
|
|||
error_score : numeric or 'raise' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this 'raise' (default) or numeric
.
@@ -1141,6 +1145,12 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, | |||
verbose : integer | |||
The verbosity level. | |||
|
|||
error_score : 'raise' (default) or numeric | |||
Value to assign to the score if an error occurs in estimator fitting. | |||
If set to 'raise', the error is raised. If a numeric value is given, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the numerical value mean? Why a numerical value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stated on the line before, Value to assign to the score if an error occurs in estimator fitting.
But yes, the ordering makes it a little unclear. Perhaps "otherwise" would suffice. I wish there were also a better word than "raise" for warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe change the ordering of the lines in the docstring. But feel free to ignore this comment.
@romaniukm : sorry for the slow reply, it's just that I am swamped. The amount of activity that scikit-learn pull requests create is crazy, and they add to my day work. |
@@ -1192,13 +1202,25 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, | |||
|
|||
X_train, y_train = _safe_split(estimator, X, y, train) | |||
X_test, y_test = _safe_split(estimator, X, y, test, train) | |||
if y_train is None: | |||
estimator.fit(X_train, **fit_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've lost the lines calling the estimator without y_train when y was None. Do we have a clear view of the consequences? If not, I'd rather keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is fair enough: we can't be sure that all unsupervised downstream estimators have a second argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean to replace this with the following?
if y_train is None:
estimator.fit(X_train, **fit_params)
else:
estimator.fit(X_train, y_train, **fit_params)
OK, aside the two comments above (y_train being none and elif clause) this is good to go as far as I am concerned. Good job! |
@GaelVaroquaux I rebased and made the changes as requested. I also decided to do some blatant self-promotion by adding my name to the contributors list in whats_new.rst ... I'm not sure though if I contributed enough so it's up to you to leave it or delete it. Thanks for the feedback! |
The Travis CI build seems to be failing but only in the configuration On my machine, two tests fail but I didn't change those files and they seem to be completely unrelated to my code: ======================================================================
FAIL: sklearn.feature_extraction.tests.test_image.test_connect_regions
----------------------------------------------------------------------
Traceback (most recent call last):
File "/vol/medic02/users/mpr06/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/vol/medic02/users/mpr06/sklearn-dev/anaconda/github/scikit-learn/sklearn/feature_extraction/tests/test_image.py", line 63, in test_connect_regions
assert_equal(ndimage.label(mask)[1], connected_components(graph)[0])
AssertionError: 777 != 767
'777 != 767' = '%s != %s' % (safe_repr(777), safe_repr(767))
'777 != 767' = self._formatMessage('777 != 767', '777 != 767')
>> raise self.failureException('777 != 767')
======================================================================
FAIL: sklearn.feature_extraction.tests.test_image.test_connect_regions_with_grid
----------------------------------------------------------------------
Traceback (most recent call last):
File "/vol/medic02/users/mpr06/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/vol/medic02/users/mpr06/sklearn-dev/anaconda/github/scikit-learn/sklearn/feature_extraction/tests/test_image.py", line 70, in test_connect_regions_with_grid
assert_equal(ndimage.label(mask)[1], connected_components(graph)[0])
AssertionError: 777 != 767
'777 != 767' = '%s != %s' % (safe_repr(777), safe_repr(767))
'777 != 767' = self._formatMessage('777 != 767', '777 != 767')
>> raise self.failureException('777 != 767') |
Can you please try to rebase your whole branch on top of the current master. I think those failures have been fixed there in the mean time. |
@ogrisel So I rebased and the Travis build passed this time, but those two tests still fail on my machine... Do you know where I can find some information about what causes this problem? |
Does [MRG+1] mean it's planned to be merged in a future release? |
MRG+1 means one reviewer has said this should be accepted. Generally, a PR requires a second +1 before merging. |
]) | ||
gs = GridSearchCV(p, {'classifier__foo_param': [1, 2, 3]}, cv=2).fit(X, y) | ||
class FailingClassifier(BaseEstimator): | ||
""" Classifier that raises a ValueError on fit() """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetics: please remove the whitespaces at the beginning and the end of the docstring to be consistent with the PEP 257 style.
Please update the what's new file to move that change to the block of the 0.16 release (you might need to rebase again on current master first). Also please squash this PR as a single commit. Then +1 for merge on my side as well. |
pick the first commit and squash the other and put a descriptive commit message. then |
@jnothman I'm not sure how your DOC commit ended up in my local master but not the github one... Was it rolled back? In that case, what should I do about it? |
Somehow you have cherry-picked in that commit (or based on-top of it); in master it includes a merge commit that appears to be absent here. You should be able to do a rebase on an updated master, but don't worry about it too much |
It looks like this branch went out of sync with master again... |
Okay. This PR has 3 +1s, and the title should reflect this... I'll try to rebase and merge today. |
Merged as 58be184. Thanks @romaniukm (and for your patience!) |
Supersedes #2587. The old pull request was so outdated (and the functionality affected was moved to different files) that I decided to re-do the work starting with a fresh checkout from master. Since this is technically a new branch, I'm opening a new pull request.