Rename grid_search module #1848

jnothman · 2013-04-09T22:58:00Z

The grid_search module now supports a list of grids, and a random-sampled parameter space, and may in the future support other search algorithms. The shared purpose is: tuning (or exploring) hyper-parameters under cross-validation. So perhaps the grid_search name should be deprecated and replaced with something like:

cv_search (or search_cv)
hyperparams
model_selection (thanks @amueller)

The text was updated successfully, but these errors were encountered:

amueller · 2013-04-10T07:31:58Z

Generally I agree. Maybe also model_selection? that could be confusing with the cross_validation module, though.

I am always in favor of renaming sooner rather than later. Only when necessary, though ;)

Any thoughts by the others?

larsmans · 2013-04-17T14:21:02Z

+1 for model_selection. If that's confusing, what about param_search?

jaquesgrobler · 2013-04-17T14:28:10Z

Also like model_selection. 👍

GaelVaroquaux · 2013-04-17T20:28:15Z

+1 for model_selection. If that's confusing, what about param_search?

I like param_search: it's understandable by everybody.

However, we want a really long deprecation procedure with this. I was
giving a course to teachers with scikit-learn yesterday, and we had a
small fight with interface changes (in the test module).

jnothman · 2013-04-17T23:08:08Z

Only problem with param_search is that arguably that's what coordinate descent, etc. are doing... But I guess just as scikit-learn's set_params doesn't deal with coefs and other such learnt parameters, so to with param_search.

jnothman · 2013-04-17T23:09:02Z

The nice thing about model_selection is its resemblance to other modules/packages like cross_validation, feature_selection, feature_extraction.

jnothman · 2013-04-21T00:00:23Z

So do we want to see sklearn.model_selection in the next release? [If we move to support heuristic searches (i.e. scipy.optimize) and other spaces (i.e. hyperopt), I also think this should move from being a module to a package, but API decisions for that will probably wait until the following release.]

amueller · 2013-04-27T10:03:59Z

param_search doesn't really roll of the tongue, though ;) Also, it is a shortening of nouns, which we generally avoid.

+1 for having it in the next release.
+1 for long deprecation cycle.

jnothman · 2013-06-19T01:32:50Z

So let's say we rename the module to model_selection for 0.14 (with the introduction of RandomizedSearchCV). Should we actually make it a package including modules cross_validation and search and perhaps pipeline? (and potentially metrics.scorer belongs here, too, but I'm uneasy about that.)

larsmans · 2013-06-22T19:14:28Z

Pipeline shouldn't go into model_selection. It's more broadly applicable.

larsmans · 2013-07-28T08:57:45Z

Moving to 0.15 because we need to pick a name, and 0.14 is due in less than a day.

jnothman · 2013-07-28T09:02:52Z

Agreed; I think model_selection was favoured, but I wonder is other stuff
belongs under that heading.

On Sun, Jul 28, 2013 at 6:57 PM, Lars Buitinck notifications@github.comwrote:

Moving to 0.15 because we need to pick a name, and 0.14 is due in less
than a day.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1848#issuecomment-21679373
.

GaelVaroquaux · 2013-07-28T09:03:48Z

Agreed; I think model_selection was favoured, but I wonder is other stuff
belongs under that heading.

Yes, that was exactly our thoughts: we want to change a few more things,
and we need a bit more time for this. But we want to head in this
direction.

amueller · 2014-01-05T09:45:16Z

I would retag this. We want to fix the return of multiple scores, but I don't see that happening for this release.

GaelVaroquaux · 2014-01-05T21:18:45Z

Agreed

GaelVaroquaux · 2014-01-05T21:19:34Z

Retaged

jnothman · 2014-02-10T13:07:28Z

How about:

sklearn/
  model_selection/
    partition.py / sample.py -- KFold, train_test_split, check_cv
    validate.py -- cross_val_score, permutation_test_score, learning_curve, validation_curve
    search.py -- GridSearchCV, RandomizedSearchCV
    scoring.py -- make_scorer, get_scorer, check_scorer, etc.
    utils.py -- ParameterGrid (may be used by validation_curve), ParameterSampler

larsmans · 2014-02-10T16:18:38Z

Sounds like a good module structure, but what would the imports look like? I.e. what's the public interface, the whole package or separate modules?

jnothman · 2014-02-10T20:27:30Z

I think most objects will be available at the top level, i.e. model_selection.__all__. IMO, it would be useful (and novel) to be able to enter:

from sklearn.model_selection import GridSearchCV, Bootstrap, make_scorer

for instance.

jnothman · 2014-02-10T20:36:04Z

From a usage perspective, it would be nice to have pipeline in here as well, but it might be harder to justify from a naming perspective.

GaelVaroquaux · 2014-02-10T22:19:21Z

I think that this would be an interesting refactor. I think that the new
module structure would be better, because it would group together related
functionality.

The bad thing of it, is that it is a break of compatibility for cosmetic
reasons, and that puts a burden on our users.

I would like to suggest that if we are going to break compatibility, we
should jump on the occasion to fix something that is actually a real
problem in the cross-validation objects. The problem is the following.

The CV objects need the data to be created. For instance, they need
n_samples, or 'y' (as in the stratified objects). This means that they
cannot be 'cloned' and reused in eg a nested cross-validation object.

A change of API could fix it in the following way:

CV objects would be instanciated only with data-independent parameters
(say n_folds)
CV objects would expose a method that takes X and optionally y and
returns an iterator with the current behavior. We could call it
something like 'make_splits'.

The refactor that you suggest would change the import path and thus
enable us to have a transition period between the current API, and the
new one.

What do people think?

jnothman · 2014-02-11T00:23:04Z

I don't think there's that great a problem moving things around at this stage in development; we'd leave placeholder modules where they were with deprecation warnings for a few releases.

But I have separately disliked the inconsistency of interface for cross validation generators. I would prefer it if data-dependent parameters weren't part of the constructor; and the generator parameters are too variable and un-memorable. For another example, train_test_split has a different return value format, so you need to wrap it in [] to use it as a cv parameter to cross_val_score, etc.

CV objects would expose a method that takes X and optionally y

One option is to just make GridSearchCV, cross_val_score, etc accept cv as a callable cv(X, y) -> iterable of pairs. check_cv already receives all necessary parameters to make that call.

One shortcoming is the handling of CV generators with grouping or sampling information such as labels or sample weights. Are these constructor or call parameters?

GaelVaroquaux · 2014-02-11T07:19:31Z

I don't think there's that great a problem moving things around at this
stage in development; we'd leave placeholder modules where they were
with deprecation warnings for a few releases.

scikit-learn has been in intensive development for 4 years. It is used
massively in many places, including in production on some pretty
important services. We should not make backward incompatible changes
lightly.

CV objects would expose a method that takes X and optionally y
One option is to just make GridSearchCV, cross_val_score, etc accept cv
as a callable cv(X, y) -> iterable of pairs. check_cv already receives
all necessary parameters to make that call.

I tend to prefer methods to callable objects: it makes for more explicit
names and the code ends up being more readable.

One shortcoming is the handling of CV generators with grouping or sampling
information such as labels or sample weights. Are these constructor or call
parameters?

Certainly not call parameters, as in a subsampling they would vary. I
would think that they are arguments of the method, but if the method
returns indices, these can be applied to new arrays, which gives
additional freedom.

jnothman · 2014-02-11T08:06:25Z

We should not make backward incompatible changes lightly.

It's a deprecation, not a backward-incompatible change, for as long as it
needs to be...?

Certainly not call parameters, as in a subsampling they would vary. I would
think that they are arguments of the method, but if the method returns
indices, these can be applied to new arrays, which gives additional freedom.

Which means check_cv needs to take additional arguments, I guess.

On 11 February 2014 18:19, Gael Varoquaux notifications@github.com wrote:

I don't think there's that great a problem moving things around at this
stage in development; we'd leave placeholder modules where they were
with deprecation warnings for a few releases.

scikit-learn has been in intensive development for 4 years. It is used
massively in many places, including in production on some pretty
important services. We should not make backward incompatible changes
lightly.

CV objects would expose a method that takes X and optionally y

One option is to just make GridSearchCV, cross_val_score, etc accept cv
as a callable cv(X, y) -> iterable of pairs. check_cv already receives
all necessary parameters to make that call.

I tend to prefer methods to callable objects: it makes for more explicit
names and the code ends up being more readable.

One shortcoming is the handling of CV generators with grouping or
sampling
information such as labels or sample weights. Are these constructor or
call
parameters?

Certainly not call parameters, as in a subsampling they would vary. I
would think that they are arguments of the method, but if the method
returns indices, these can be applied to new arrays, which gives
additional freedom.

Reply to this email directly or view it on GitHubhttps://github.com//issues/1848#issuecomment-34731372
.

larsmans · 2014-02-11T10:29:06Z

I would like to suggest that if we are going to break compatibility, we should jump on the occasion to fix something that is actually a real problem in the cross-validation objects.

How serious is this problem? It sounds like you have a very advanced use case in mind. While I sometimes do nested model selection, I typically do so in contexts where the pipeline/grid search workflow breaks down anyway (e.g. custom semi-supervised learning) so I personally wouldn't care much for in-library support.

Also, would this break custom CV objects such as my repeated CV for sequence data?

glouppe · 2014-02-11T10:41:02Z

Nested model selection is not an advanced use case. This is the only proper way for both selecting and evaluating your model without bias, which basically every ML paper should have done when reporting results of an algorithm for which they have tuned hyper-parameters.

larsmans · 2014-02-11T11:54:22Z

It's the nesting that I consider advanced, not the idea of model selection by cross-validation. I guess I'm missing something here...

glouppe · 2014-02-11T12:00:17Z

It's the nesting that I consider advanced, not the idea of model selection by cross-validation. I guess I'm missing something here...

The CV scores that you optimize (e.g., with a grid-search procedure) are not an unbiased estimate of the generalization error of your model. (As it is often the case in papers,) If you want to both i) find the best parameters and ii) estimate without bias the generalization error, then you should run a train/valid/test protocol or do nested model selection.

larsmans · 2014-02-11T12:44:55Z

Ok, that's what you mean. I tend to use held-out test sets for final evaluation. I was thinking of nesting e.g. RFECV inside GridSearchCV.

Coming back to the discussion then, do any of you have an example of something that is not possible in the current API?

jnothman · 2014-02-11T20:58:10Z

You can't currently use: GridSearchCV(GridSearchCV(SVC(), cv=KFold(...)))
for arbitrary data because KFold needs to be constructed with the number of
samples known.

On 11 February 2014 23:44, Lars Buitinck notifications@github.com wrote:

Ok, that's what you mean. I tend to use held-out test sets for final
evaluation. I was thinking of nesting e.g. RFECV inside GridSearchCV.

Coming back to the discussion then, do any of you have an example of
something that is not possible in the current API?

Reply to this email directly or view it on GitHubhttps://github.com//issues/1848#issuecomment-34750678
.

larsmans · 2014-02-14T13:27:36Z

Alright, +1 for a careful API change. It would be nicest if the old API just keeps working.

GaelVaroquaux · 2014-02-14T16:13:09Z

I agree that we should strive to have it working for quite a while.

I just want to sort out these issues before 1.0.

larsmans · 2014-02-14T16:25:48Z

Actually I'd like it most if the old API just kept working forever...

GaelVaroquaux · 2014-02-14T16:27:06Z

Actually I'd like it most if the old API just kept working forever...

It's broken. Really. It was a bad design. It makes it impossible to do
nested cross-validation. Sorry, I made a mistake.

We will need to stop supporting it at some point.

jnothman mentioned this issue Feb 10, 2014

[WIP] Multiple-metric grid search #2759

Closed

GaelVaroquaux mentioned this issue Feb 26, 2014

data-independent CV iterators #2904

Closed

pignacio mentioned this issue Jul 3, 2014

[WIP] Data independent CV and model_selection module #3340

Closed

amueller added API Enhancement labels Jan 23, 2015

amueller modified the milestones: 1.0, 0.16 Jan 23, 2015

raghavrv mentioned this issue Feb 25, 2015

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

Merged

24 tasks

raghavrv mentioned this issue Mar 22, 2015

[WIP] Clean up of cross_validation et al. and model_selection refactoring #4254

Closed

amueller modified the milestones: 0.16, 0.17 Sep 11, 2015

amueller modified the milestones: 0.18, 0.17 Sep 20, 2015

amueller closed this as completed in #4294 Oct 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename grid_search module #1848

Rename grid_search module #1848

jnothman commented Apr 9, 2013

amueller commented Apr 10, 2013

larsmans commented Apr 17, 2013

jaquesgrobler commented Apr 17, 2013

GaelVaroquaux commented Apr 17, 2013

jnothman commented Apr 17, 2013

jnothman commented Apr 17, 2013

jnothman commented Apr 21, 2013

amueller commented Apr 27, 2013

jnothman commented Jun 19, 2013

larsmans commented Jun 22, 2013

larsmans commented Jul 28, 2013

jnothman commented Jul 28, 2013

GaelVaroquaux commented Jul 28, 2013

amueller commented Jan 5, 2014

GaelVaroquaux commented Jan 5, 2014

GaelVaroquaux commented Jan 5, 2014

jnothman commented Feb 10, 2014

larsmans commented Feb 10, 2014

jnothman commented Feb 10, 2014

jnothman commented Feb 10, 2014

GaelVaroquaux commented Feb 10, 2014

jnothman commented Feb 11, 2014

GaelVaroquaux commented Feb 11, 2014

jnothman commented Feb 11, 2014

larsmans commented Feb 11, 2014

glouppe commented Feb 11, 2014

larsmans commented Feb 11, 2014

glouppe commented Feb 11, 2014

larsmans commented Feb 11, 2014

jnothman commented Feb 11, 2014

larsmans commented Feb 14, 2014

GaelVaroquaux commented Feb 14, 2014

larsmans commented Feb 14, 2014

GaelVaroquaux commented Feb 14, 2014

Rename grid_search module #1848

Rename grid_search module #1848

Comments

jnothman commented Apr 9, 2013

amueller commented Apr 10, 2013

larsmans commented Apr 17, 2013

jaquesgrobler commented Apr 17, 2013

GaelVaroquaux commented Apr 17, 2013

jnothman commented Apr 17, 2013

jnothman commented Apr 17, 2013

jnothman commented Apr 21, 2013

amueller commented Apr 27, 2013

jnothman commented Jun 19, 2013

larsmans commented Jun 22, 2013

larsmans commented Jul 28, 2013

jnothman commented Jul 28, 2013

GaelVaroquaux commented Jul 28, 2013

amueller commented Jan 5, 2014

GaelVaroquaux commented Jan 5, 2014

GaelVaroquaux commented Jan 5, 2014

jnothman commented Feb 10, 2014

larsmans commented Feb 10, 2014

jnothman commented Feb 10, 2014

jnothman commented Feb 10, 2014

GaelVaroquaux commented Feb 10, 2014

jnothman commented Feb 11, 2014

GaelVaroquaux commented Feb 11, 2014

jnothman commented Feb 11, 2014

larsmans commented Feb 11, 2014

glouppe commented Feb 11, 2014

larsmans commented Feb 11, 2014

glouppe commented Feb 11, 2014

larsmans commented Feb 11, 2014

jnothman commented Feb 11, 2014

larsmans commented Feb 14, 2014

GaelVaroquaux commented Feb 14, 2014

larsmans commented Feb 14, 2014

GaelVaroquaux commented Feb 14, 2014