-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Timing and training score in *SearchCV.results_ #7026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now *SearchCV.results_ includes both timing and training scores.
and new doctest (sklearn/model_selection/_search.py)
'test_std_score' : [0.02, 0.01, 0.03, 0.03], | ||
'test_rank_score' : [2, 4, 3, 1], | ||
'params' : [{'kernel': 'poly', 'degree': 2}, ...], | ||
} | ||
|
||
NOTE that the key ``'params'`` is used to store a list of parameter | ||
settings dict for all the parameter candidates. | ||
settings dict for all the parameter candidates. Besides, | ||
'train_mean_score', 'train_split*_score', ... will be present when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the style of using double backticks to indicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that it should be backticks surrounding the single quoted key.
'train_mean_score'
(which will display as 'train_mean_score'
)
Thanks for the PR!! |
np.average((time - time_means[:, np.newaxis]) ** 2, | ||
axis=1, weights=weights)) | ||
if self.return_train_score: | ||
train_means = np.average(train_scores, axis=1, weights=weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to weight the train scores by the number of test samples. Perhaps we should weight by the number of training samples when iid=True
, or report an unweighted average. For times, weighting by training instances is more appropriate than by test, but leaving it unweighted might be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Also, I think iid
is a bad hack and we need to fix the scorers to accept both training and test data to solve this properly :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm a bit slow here, but please elucidate at some point. What's iid
trying to hack? Why does accepting training data in the scorer help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's trying to hack the R^2, IIRC. @agramfort can say more about this, I think.
The way we compute R^2 is weird because we are using the test set mean, not the training set mean. Though actually after thinking about it more, maybe accepting training data is too big a change just for that. I thought it would also resolve some unsupervised issued, but maybe it doesn't.
Perhaps we should avoid code duplication by defining |
@@ -1104,4 +1155,4 @@ def fit(self, X, y=None, labels=None): | |||
sampled_params = ParameterSampler(self.param_distributions, | |||
self.n_iter, | |||
random_state=self.random_state) | |||
return self._fit(X, y, labels, sampled_params) | |||
return self._fit(X, y, labels, sampled_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have strangely removed the newline character from the end of this file.
I've renamed the PR to something clearer. |
@@ -371,7 +371,7 @@ class BaseSearchCV(six.with_metaclass(ABCMeta, BaseEstimator, | |||
def __init__(self, estimator, scoring=None, | |||
fit_params=None, n_jobs=1, iid=True, | |||
refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', | |||
error_score='raise'): | |||
error_score='raise', return_train_score=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was in the issue, but I'm not sure why we want to make that optional / not the default. Computing the training scores is really cheap compared to everything else that goes on. There are already a lot of parameters, not sure if this one is particularly helpful. @jnothman ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess making it optional is helpful if you have a very large training set on which evaluation takes a long time because of an very expensive evaluation procedure. But I think we should default to True
both for convenience and discoverability.
It would be nice to add the example of #1742 but I think that's not that important. LGTM. |
wait, sorry I thought this was only the timing. For the training scores, we should really add an example and tests. |
@eyc88 I think you said you might not have time to work on this now. Should someone else take over? |
@amueller If you just want to test that the training score is smaller or equal to 1, I can deliver it soon (one thing I would like to respond to @ogrisel is that training score can be exactly 1). I am now travelling so there could be some delay. All that being said, I don't mind if someone wants to take it over. |
1. check test_rank_score always >= 1 2. check all regular scores (test/train_mean/std_score) and timing >= 0 3. check all regular scores <= 1 Note that timing can be greater than 1 in general, and std of regular scores always <= 1 because the scores are bounded between 0 and 1.
@amueller Check this out. Let me know if this is what you had in mind. We can discuss. |
A test for training score would usually require a more explicit test that we're reporting the right number. We want, for instance, the tests to be fairly robust to this all being reimplemented. |
Yeah so a good test would be to instantiate a cross-validation object and do the calculations "by hand" for some simple example and check that |
Sounds good. I will take a closer look and write a new version of test once I got some free time. |
@jnothman When are you guys rolling out 0.18? I can sqeeze some bandwidth if the deadline is approaching... |
Within the next week or two. |
I think tests were my only issue. and I'd love to see this in 0.18 |
@eyc88 Do you mind if I take this over? |
@raghavrv No. |
Closing in favour of #7325. |
Reference Issue
Resolves #6894 Timing in *SearchCV.results_
#6895 Training Score in *SearchCV.results_
What does this implement/fix? Explain your changes.
The _fit method in BaseSearchCV calls the _fit_and_score function from module _validation.
fit_and_score has the ability to report train_score, but not by default. On the other hand,
the init of BaseSearchCV does not accept return_train_score as an optional keyword. As a result, there was no way to have the results attribute receving train score.
The fix is to add the keyword return_train_score to the init of BaseSearchCV (which is inherited by both GridSearchCV and RandomizedSearchCV). Default is still set to False. When return_train_score is set to True, a set of addition steps is taken to put train_score into results_
The timing information is always available but was not written into "results_". This patch also writes timing info into results_ by default.
Any other comments?
Test is also re-written. In the test I explicitly choose return_train_score=True for better coverage.