Skip to content

Rename grid_search module #1848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Apr 9, 2013 · 34 comments · Fixed by #4294
Closed

Rename grid_search module #1848

jnothman opened this issue Apr 9, 2013 · 34 comments · Fixed by #4294

Comments

@jnothman
Copy link
Member

jnothman commented Apr 9, 2013

The grid_search module now supports a list of grids, and a random-sampled parameter space, and may in the future support other search algorithms. The shared purpose is: tuning (or exploring) hyper-parameters under cross-validation. So perhaps the grid_search name should be deprecated and replaced with something like:

  • cv_search (or search_cv)
  • hyperparams
  • model_selection (thanks @amueller)
@amueller
Copy link
Member

Generally I agree. Maybe also model_selection? that could be confusing with the cross_validation module, though.

I am always in favor of renaming sooner rather than later. Only when necessary, though ;)

Any thoughts by the others?

@larsmans
Copy link
Member

+1 for model_selection. If that's confusing, what about param_search?

@jaquesgrobler
Copy link
Member

Also like model_selection. 👍

@GaelVaroquaux
Copy link
Member

+1 for model_selection. If that's confusing, what about param_search?

I like param_search: it's understandable by everybody.

However, we want a really long deprecation procedure with this. I was
giving a course to teachers with scikit-learn yesterday, and we had a
small fight with interface changes (in the test module).

@jnothman
Copy link
Member Author

Only problem with param_search is that arguably that's what coordinate descent, etc. are doing... But I guess just as scikit-learn's set_params doesn't deal with coefs and other such learnt parameters, so to with param_search.

@jnothman
Copy link
Member Author

The nice thing about model_selection is its resemblance to other modules/packages like cross_validation, feature_selection, feature_extraction.

@jnothman
Copy link
Member Author

So do we want to see sklearn.model_selection in the next release? [If we move to support heuristic searches (i.e. scipy.optimize) and other spaces (i.e. hyperopt), I also think this should move from being a module to a package, but API decisions for that will probably wait until the following release.]

@amueller
Copy link
Member

param_search doesn't really roll of the tongue, though ;) Also, it is a shortening of nouns, which we generally avoid.

+1 for having it in the next release.
+1 for long deprecation cycle.

@jnothman
Copy link
Member Author

So let's say we rename the module to model_selection for 0.14 (with the introduction of RandomizedSearchCV). Should we actually make it a package including modules cross_validation and search and perhaps pipeline? (and potentially metrics.scorer belongs here, too, but I'm uneasy about that.)

@larsmans
Copy link
Member

Pipeline shouldn't go into model_selection. It's more broadly applicable.

@larsmans
Copy link
Member

Moving to 0.15 because we need to pick a name, and 0.14 is due in less than a day.

@jnothman
Copy link
Member Author

Agreed; I think model_selection was favoured, but I wonder is other stuff
belongs under that heading.

On Sun, Jul 28, 2013 at 6:57 PM, Lars Buitinck notifications@github.comwrote:

Moving to 0.15 because we need to pick a name, and 0.14 is due in less
than a day.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1848#issuecomment-21679373
.

@GaelVaroquaux
Copy link
Member

Agreed; I think model_selection was favoured, but I wonder is other stuff
belongs under that heading.

Yes, that was exactly our thoughts: we want to change a few more things,
and we need a bit more time for this. But we want to head in this
direction.

@amueller
Copy link
Member

amueller commented Jan 5, 2014

I would retag this. We want to fix the return of multiple scores, but I don't see that happening for this release.

@GaelVaroquaux
Copy link
Member

Agreed

@GaelVaroquaux
Copy link
Member

Retaged

@jnothman
Copy link
Member Author

How about:

sklearn/
  model_selection/
    partition.py / sample.py -- KFold, train_test_split, check_cv
    validate.py -- cross_val_score, permutation_test_score, learning_curve, validation_curve
    search.py -- GridSearchCV, RandomizedSearchCV
    scoring.py -- make_scorer, get_scorer, check_scorer, etc.
    utils.py -- ParameterGrid (may be used by validation_curve), ParameterSampler

@larsmans
Copy link
Member

Sounds like a good module structure, but what would the imports look like? I.e. what's the public interface, the whole package or separate modules?

@jnothman
Copy link
Member Author

I think most objects will be available at the top level, i.e. model_selection.__all__. IMO, it would be useful (and novel) to be able to enter:

from sklearn.model_selection import GridSearchCV, Bootstrap, make_scorer

for instance.

@jnothman
Copy link
Member Author

From a usage perspective, it would be nice to have pipeline in here as well, but it might be harder to justify from a naming perspective.

@GaelVaroquaux
Copy link
Member

I think that this would be an interesting refactor. I think that the new
module structure would be better, because it would group together related
functionality.

The bad thing of it, is that it is a break of compatibility for cosmetic
reasons, and that puts a burden on our users.

I would like to suggest that if we are going to break compatibility, we
should jump on the occasion to fix something that is actually a real
problem in the cross-validation objects. The problem is the following.

The CV objects need the data to be created. For instance, they need
n_samples, or 'y' (as in the stratified objects). This means that they
cannot be 'cloned' and reused in eg a nested cross-validation object.

A change of API could fix it in the following way:

  • CV objects would be instanciated only with data-independent parameters
    (say n_folds)
  • CV objects would expose a method that takes X and optionally y and
    returns an iterator with the current behavior. We could call it
    something like 'make_splits'.

The refactor that you suggest would change the import path and thus
enable us to have a transition period between the current API, and the
new one.

What do people think?

@jnothman
Copy link
Member Author

I don't think there's that great a problem moving things around at this stage in development; we'd leave placeholder modules where they were with deprecation warnings for a few releases.

But I have separately disliked the inconsistency of interface for cross validation generators. I would prefer it if data-dependent parameters weren't part of the constructor; and the generator parameters are too variable and un-memorable. For another example, train_test_split has a different return value format, so you need to wrap it in [] to use it as a cv parameter to cross_val_score, etc.

CV objects would expose a method that takes X and optionally y

One option is to just make GridSearchCV, cross_val_score, etc accept cv as a callable cv(X, y) -> iterable of pairs. check_cv already receives all necessary parameters to make that call.

One shortcoming is the handling of CV generators with grouping or sampling information such as labels or sample weights. Are these constructor or call parameters?

@GaelVaroquaux
Copy link
Member

I don't think there's that great a problem moving things around at this
stage in development; we'd leave placeholder modules where they were
with deprecation warnings for a few releases.

scikit-learn has been in intensive development for 4 years. It is used
massively in many places, including in production on some pretty
important services. We should not make backward incompatible changes
lightly.

CV objects would expose a method that takes X and optionally y

One option is to just make GridSearchCV, cross_val_score, etc accept cv
as a callable cv(X, y) -> iterable of pairs. check_cv already receives
all necessary parameters to make that call.

I tend to prefer methods to callable objects: it makes for more explicit
names and the code ends up being more readable.

One shortcoming is the handling of CV generators with grouping or sampling
information such as labels or sample weights. Are these constructor or call
parameters?

Certainly not call parameters, as in a subsampling they would vary. I
would think that they are arguments of the method, but if the method
returns indices, these can be applied to new arrays, which gives
additional freedom.

@jnothman
Copy link
Member Author

We should not make backward incompatible changes lightly.

It's a deprecation, not a backward-incompatible change, for as long as it
needs to be...?

Certainly not call parameters, as in a subsampling they would vary. I would
think that they are arguments of the method, but if the method returns
indices, these can be applied to new arrays, which gives additional freedom.

Which means check_cv needs to take additional arguments, I guess.

On 11 February 2014 18:19, Gael Varoquaux notifications@github.com wrote:

I don't think there's that great a problem moving things around at this
stage in development; we'd leave placeholder modules where they were
with deprecation warnings for a few releases.

scikit-learn has been in intensive development for 4 years. It is used
massively in many places, including in production on some pretty
important services. We should not make backward incompatible changes
lightly.

CV objects would expose a method that takes X and optionally y

One option is to just make GridSearchCV, cross_val_score, etc accept cv
as a callable cv(X, y) -> iterable of pairs. check_cv already receives
all necessary parameters to make that call.

I tend to prefer methods to callable objects: it makes for more explicit
names and the code ends up being more readable.

One shortcoming is the handling of CV generators with grouping or
sampling
information such as labels or sample weights. Are these constructor or
call
parameters?

Certainly not call parameters, as in a subsampling they would vary. I
would think that they are arguments of the method, but if the method
returns indices, these can be applied to new arrays, which gives
additional freedom.

Reply to this email directly or view it on GitHubhttps://github.com//issues/1848#issuecomment-34731372
.

@larsmans
Copy link
Member

I would like to suggest that if we are going to break compatibility, we should jump on the occasion to fix something that is actually a real problem in the cross-validation objects.

How serious is this problem? It sounds like you have a very advanced use case in mind. While I sometimes do nested model selection, I typically do so in contexts where the pipeline/grid search workflow breaks down anyway (e.g. custom semi-supervised learning) so I personally wouldn't care much for in-library support.

Also, would this break custom CV objects such as my repeated CV for sequence data?

@glouppe
Copy link
Contributor

glouppe commented Feb 11, 2014

Nested model selection is not an advanced use case. This is the only proper way for both selecting and evaluating your model without bias, which basically every ML paper should have done when reporting results of an algorithm for which they have tuned hyper-parameters.

@larsmans
Copy link
Member

It's the nesting that I consider advanced, not the idea of model selection by cross-validation. I guess I'm missing something here...

@glouppe
Copy link
Contributor

glouppe commented Feb 11, 2014

It's the nesting that I consider advanced, not the idea of model selection by cross-validation. I guess I'm missing something here...

The CV scores that you optimize (e.g., with a grid-search procedure) are not an unbiased estimate of the generalization error of your model. (As it is often the case in papers,) If you want to both i) find the best parameters and ii) estimate without bias the generalization error, then you should run a train/valid/test protocol or do nested model selection.

@larsmans
Copy link
Member

Ok, that's what you mean. I tend to use held-out test sets for final evaluation. I was thinking of nesting e.g. RFECV inside GridSearchCV.

Coming back to the discussion then, do any of you have an example of something that is not possible in the current API?

@jnothman
Copy link
Member Author

You can't currently use: GridSearchCV(GridSearchCV(SVC(), cv=KFold(...)))
for arbitrary data because KFold needs to be constructed with the number of
samples known.

On 11 February 2014 23:44, Lars Buitinck notifications@github.com wrote:

Ok, that's what you mean. I tend to use held-out test sets for final
evaluation. I was thinking of nesting e.g. RFECV inside GridSearchCV.

Coming back to the discussion then, do any of you have an example of
something that is not possible in the current API?

Reply to this email directly or view it on GitHubhttps://github.com//issues/1848#issuecomment-34750678
.

@larsmans
Copy link
Member

Alright, +1 for a careful API change. It would be nicest if the old API just keeps working.

@GaelVaroquaux
Copy link
Member

I agree that we should strive to have it working for quite a while.

I just want to sort out these issues before 1.0.

@larsmans
Copy link
Member

Actually I'd like it most if the old API just kept working forever...

@GaelVaroquaux
Copy link
Member

Actually I'd like it most if the old API just kept working forever...

It's broken. Really. It was a bad design. It makes it impossible to do
nested cross-validation. Sorry, I made a mistake.

We will need to stop supporting it at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment