Skip to content

[MRG] Refactor CV and grid search #2736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

AlexanderFabisch
Copy link
Member

While implementing #2701 I have seen some duplicate code in grid_search and cross_validation. Before I will implement the rest of issue #2584 I would like to clean that up a little bit.

Todo:

  • merge cross_validation.cross_val_score and grid_search.fit_grid_point

@jnothman
Copy link
Member

jnothman commented Jan 9, 2014

Thank you! I've been trying to get to another attempt at this. I can't review right away, but I'll get there soon.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling c4d6278 on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

@AlexanderFabisch
Copy link
Member Author

Are there other related pull requests like #2079?

@jnothman
Copy link
Member

Are there other related pull requests like #2079?

Its predecessor, #1787, also did this refactoring. They also hoped to make a consistent extensible output from cross_val_score and grid_search; and in #2079 to refactor their parallelism. Keeping the scope of this PR small is wise :)

I also consider this related to some of the CV API consistency stuff @eshilts is propsing in #2733.



def _fit(fit_function, X_train, y_train, **fit_params):
"""Fit and estimator on a given training set."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and" -> "an"

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 5e52031 on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 4b5f468 on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

@AlexanderFabisch
Copy link
Member Author

Now, fit_grid_point calls _cross_val_score.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 30c86ea on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 30c86ea on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

score = _score(estimator, X_test, y_test, scorer)

scoring_time = time.time() - start_time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You appear to have dropped the verbose output here. Why not report time and score in _cross_val_score?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 389ed8d on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 1fa3ec3 on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

@jnothman
Copy link
Member

I think this is looking very good, so I'll just do some nitpicking :P

@@ -198,6 +198,69 @@ def get_scorer(scoring):
return scorer


class _passthrough_scorer(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how I named it, but I think others will prefer camel-case: _PassthroughScorer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that would be a convention. :) I will change that.

@ogrisel
Copy link
Member

ogrisel commented Jan 15, 2014

Apart from the previous comments on error messages checks in the tests and private helpers usage and naming, this PR LGTM.

return np.array(scores)[:, 0]


def _cross_val_score(estimator, X, y, scorer, train, test,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this function needs renaming. I called it originally "_cross_val_score" because it was the inner loop of the cross_val_score function, but now that it is used in a different setting, I find that the name is confusing, as it doesn't really do a cross-validation. Maybe something like "_fit_and_score"?

PS: Sorry for nitpicking about the names, but they are important as they control the readability of a codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before: I like the nitpicking. It is especially helpful to find meaningful names and naming is the most important thing to create readable software. I think you do a great job. :)

I renamed it to fit_and_score.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 33994f0 on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

return np.array(scores)[:, 0]


def fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaelVaroquaux, do you think fit_and_score should be public? Personally, I don't think either it or fit_grid_point should be public (and while not _-prefixed, fit_grid_point isn't in classes.rst, nor is it any longer the right name for that function). It makes it much harder to add functionality to grid search etc. since every modification to its output is an API change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't understand what the objection is to calling this cross-validation... I think that's what fitting on one chunk of data and scoring on another is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on using private functions

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 8bc79b3 on AlexanderFabisch:refactor_cv into b3a9da9 on scikit-learn:master.

@mblondel
Copy link
Member

I will work on addressing the final comments and merging.

@mblondel
Copy link
Member

I merged by rebase.

I haven't made fit_grid_point private yet. It has been there for a long time so I would prefer to wait for other opinions before changing anything. If we decide to make it private, we can keep the public one for two releases.

@mblondel mblondel closed this Jan 16, 2014
@jnothman
Copy link
Member

Hurrah! Thank you @AlexanderFabisch for some long-needed maintenance, and @mblondel for pushing it through. I look forward to your and other extensions to this to get more returned from _fit_and_score.

@mblondel
Copy link
Member

Indeed, thanks for your recent contributions, @AlexanderFabisch.

@mblondel
Copy link
Member

I removed the _score utility function in my multiple_grid_search branch:
mblondel@a756083

I think this makes the code better.


Returns
-------
test_score : float
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I messed this up. The order of test_score and train_score is wrong in the docstring.

@AlexanderFabisch
Copy link
Member Author

Oh, I did not expect the PR to get merged so soon after all the difficulties. Unfortunately I found the bug in the documentation too late. @mblondel could you fix this?

Anyway, thanks for the compliments. :)

@mblondel
Copy link
Member

I just realized that fit_grid_point is no longer used anywhere in the code base. Shall we remove it then?

@jnothman
Copy link
Member

+1 to deprecate in case it's in private use

On 16 January 2014 19:20, Mathieu Blondel notifications@github.com wrote:

I just realized that fit_grid_point is no longer used anywhere in the
code base. Shall we remove it then?


Reply to this email directly or view it on GitHubhttps://github.com//pull/2736#issuecomment-32449619
.

@ogrisel
Copy link
Member

ogrisel commented Jan 16, 2014

Please deprecate it as this method had a public name (no leading underscore).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants