[WIP] Data independent CV and model_selection module #3340

pignacio · 2014-07-03T16:21:37Z

Hi guys,

I'm working on #1848 and #2904, and would like to get some feedback on my changes.

Here's a writeup about the issues/changes: http://pignacio.github.io/jwoc-blog/stories/stuff-about-scikit-learn18482904.html

Any thoughts / suggestions / improvements / rants on the changes are greatly appreciated

jnothman · 2014-07-04T00:13:22Z

Thanks, @pignacio. I can't look right now, but would like to note that we designate PRs with [WIP] in their title as "Work in progress (comments welcome)", as opposed to [MRG]("Ready to review for merge") and [MRG+1]("One reviewer has approved for merge")... So I'm popping [WIP] in front of this PR's title.

jnothman · 2014-07-04T01:13:38Z

I don't think I get the need for _PartitionTestGenerator. You say:

As the same CV iterator can now be used several times (particularly in a nested way), a new object which controls the generation must be created each time, and must be independent from all others.

If def split(...) has yield statements in it, then calling split(...) will automatically create a new generator object, that is independent from other generators, as long as no shared state is modified.

jnothman · 2014-07-04T01:17:35Z

Still not looking in detail, but I think we want to remove data-dependent parameters (e.g. n in KFold) from CV constructors. Even so, we need to maintain the old interface in sklearn.cross_validation

jnothman · 2014-07-04T01:18:53Z

sklearn/grid_search.py


+from sklearn.model_selection.search import GridSearchCV, RandomizedSearchCV


please use relative imports

jnothman · 2014-07-04T01:19:59Z

I note that there will also be a raft of changes to the documentation

pignacio · 2014-07-10T20:51:42Z

Fixed the minor issues
Leave*Out classes work with split now. *Shuffle* ones need some more work

I don't think I get the need for _PartitionTestGenerator [...]

You are right. _PartitionTestGenerator is not needed. At first glance, i thought there was some shared internal state that might cause problems, but it looks good.

Still not looking in detail, but I think we want to remove data-dependent parameters (e.g. n in KFold) from CV constructors. Even so, we need to maintain the old interface in sklearn.cross_validation

Right now I'm using defaults so that the same class can be used both ways:

for train, test in KFold(n, n_folds=3):
    print train, test

for train, test in KFold(n_folds=3).split(target):
    print train, test

Once the backwards compatibility is not needed anymore, we should be able to remove those parameters. Also, __iter__ now falls back to split(None) (Which means 'old behaviour, class members should be used'), which leads to some extra checks here and there, that will also be removed

I note that there will also be a raft of changes to the documentation

Yes, of course, also tests. Once we get the nasty details of this right, I'll get into that

I can't look right now, but would like to note that we designate PRs with [WIP] in their title as "Work in progress (comments welcome)", as opposed to MRG and MRG+1... So I'm popping [WIP] in front of this PR's title.

Sorry about that, I never intended to tag this like ready to merge or anything.

jnothman · 2014-07-10T22:48:29Z

Sorry about that, I never intended to tag this like ready to merge or anything.

No problem, just letting you know why it changed.

jnothman · 2014-07-11T02:05:32Z

Once the backwards compatibility is not needed anymore, we should be able to remove those parameters.

So, presumably, we should be raising a DeprecationWarning when any data-dependent arguments are filled. We would also delete these from the Parameters listing in the documentation, and emphasise in the documentation for n_folds that it must be specified by keyword.

I am worried this will lay too many usability traps, and I think @GaelVaroquaux was hoping that in renaming the path to this module, we could keep two separate copies of the functionality while deprecating, and not worry about backwards compatibility in the new one. Similarly, the new version could ignore any handling of the deprecated indices argument, etc.

GaelVaroquaux · 2014-07-14T14:05:24Z

sklearn/covariance/graph_lasso_.py

@@ -388,7 +389,7 @@ def graph_lasso_path(X, alphas, cov_init=None, X_test=None, mode='cd',


 class GraphLassoCV(GraphLasso):
-    """Sparse inverse covariance w/ cross-validated choice of the l1 penalty
+    """Sparse inverse covariance w/ cross-from .partition import _check_cvvalidated choice of the l1 penalty


Looks like you have a spurious line here.

GaelVaroquaux · 2014-07-14T14:27:53Z

I am worried this will lay too many usability traps, and I think @GaelVaroquaux was hoping that in renaming the path to this module

Yes, this is exactly what I had in mind. It would indeed probably be a bit simpler, although it would temporarily lead to duplicated code.

That said, this approach is also very interesting. I need to think it over a bit more.

I must say that in terms of names, I am not terribly happy with importing the cross_validation objects from model_selection.partition, as they are core objects and it makes them hard to discover. We should import them directly from model_selection (and thus put them in the init of the module).

Thanks a lot @pignacio for tackling this issue. It's a very important one, and you are doing it in a very elegant way.

kastnerkyle · 2014-07-15T13:06:43Z

sklearn/cross_validation.py

-from .externals.six import with_metaclass
-from .externals.six.moves import zip
-from .metrics.scorer import check_scoring
+from .model_selection.partition import LeaveOneOut, LeavePOut, KFold, \


Can we split this into separate lines, rather than continuation? i.e.

from .model_selection.partition import LeaveOneOut, LeavePOut, KFold
from .model_selection.partition import StratifiedKFold, LeaveOneLabelOut

etc.

kastnerkyle · 2014-07-15T13:14:42Z

I made a few small nitpick comments but this looks really great! I also agree on the "have two versions, deprecate old" approach.

vene · 2014-07-16T15:54:47Z

I must say that in terms of names, I am not terribly happy with importing the cross_validation objects from model_selection.partition, as they are core objects and it makes them hard to discover. We should import them directly from model_selection (and thus put them in the init of the module).

+1, I think this is especially possible thanks to our descriptive function and object names. We can keep a nested structure for code organization, but expose a flat interface:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV

Currently, even for the simplest of realistic examples, you'd need to import from 4-5 different submodules. This would be a step in the right direction.

vene · 2014-07-24T09:25:40Z

This PR is getting hard to review, because the renaming, coupled with the significant change, makes it hard to read the diff. It's a pity, because there is a lot of interest in this enhancement.

I thought for a bit what the best way to proceed would be, and I suggest this:

Have a new PR that just does the renaming + deprecation, but makes copies of the existing code being renamed (e.g., copy sklearn.grid_search.GridSearchCV to sklearn.model_selection.GridSearchCV as it is, and deprecate sklearn.grid_search.GridSearchCV)
Have a PR that implements the functionality by changing the copies (sklearn.model_selection.GridSearchCV).

What do you think?

jnothman · 2014-07-24T09:34:20Z

Yes, I think others have been voting for that deprecation strategy in any
case. Certainly as a way forward, being able to focus only on the new API
should be helpful.

Ping @pignacio, are you still with us?

On 24 July 2014 19:25, Vlad Niculae notifications@github.com wrote:

This PR is getting hard to review, because the renaming, coupled with
the significant change, makes it hard to read the diff. It's a pity,
because there is a lot of interest in this enhancement.

I thought for a bit what the best way to proceed would be, and I suggest
this:

Have a new PR that just does the renaming + deprecation, but makes
copies of the existing code being renamed (e.g., copy
sklearn.grid_search.GridSearchCV to sklearn.model_selection.GridSearchCV
as it is, and deprecate sklearn.grid_search.GridSearchCV)

Have a PR that implements the functionality by changing the copies (
sklearn.model_selection.GridSearchCV).

What do you think?

—
Reply to this email directly or view it on GitHub
#3340 (comment)
.

pignacio · 2014-07-24T13:36:27Z

Yep, right here.

Been a little busy these last days.

If we all agree, we can split this PR like vene suggests. I'll work on 1) for now so we can close that.

@kastnerkyle

I made a few small nitpick comments but this looks really great! I also agree on the "have two versions, deprecate old" approach.

+1 on comments of any kind. The more eyes looking this the better :)

jnothman · 2014-07-25T00:34:11Z

Well, reading it again, I'm not actually sure that there's much point in @vene's two-phase approach. Copying the old versions only means that we're giving users of the development version a false deprecation warning, or rather telling them to use something that doesn't have the right interface yet.

The deprecations are module-level: we can warn when the user imports the module, so it's actually just a two-line header in each module, not a great change.

Rather, I think:

deprecate each module as necessary, but don't do everything at once. Much of the problem is just that github doesn't show file moves very clearly.
the new CV implementations shouldn't try to be backwards compatible.
the new CV implementations are independent of moving scoring, etc., so the API changes should be made without adding noise by moving other files around.

Even if we'd like this move to be released in one version, putting it all in one PR is just too overwhelming. And any/most changes outside of the cross_validation module should be backwards-compatible, as far as I can tell.

raghavrv · 2015-02-16T08:28:08Z

@pignacio Would you be able to find time to continue the PR? Or if its okay by you could I continue with your work by cherry-picking your commits?

jnothman · 2015-02-16T10:51:21Z

@ragv, just to note this is probably not as simple as it looks...

On 16 February 2015 at 19:28, ragv notifications@github.com wrote:

@pignacio https://github.com/pignacio Would you be able to find time
to continue the PR? Or if its okay by you could I continue with your work
by cherry-picking your commits?

—
Reply to this email directly or view it on GitHub
#3340 (comment)
.

pignacio · 2015-02-16T12:37:07Z

@ragv No, i couldn't :). Feel free to take from here whatever you feel will help, and let me know if I can help, but as @jnothman notes, this was a somewhat complex change.

Good luck :)

raghavrv · 2015-02-16T13:35:07Z

@jnothman and @pignacio Thanks for the heads up :) I'll ping you guys if I run into any trouble.... :p

pignacio · 2015-02-19T02:46:21Z

Hi all, since this pull request was mentioned in the mailing list, and @GaelVaroquaux showed interest on seeing this closed, I'd like to share my thoughts respecting this changes.

A brief recap first:

This pull request intended to fix two issues:

Rename grid_search module #1848: Migrate sklearn.grid_search to sklearn.model_selection.
data-independent CV iterators #2904: Data independent CV iterators: Basically changing the CV interface
from:
```
for train, test in KFold(n, n_folds=3):
    pass
```
to:
```
for train, test in KFold(n_folds=3).split(data):
    pass
```
Also, the possibility of allowing a pandas dataframe or a dict of arrays as a value for data is discussed in the comments (See data-independent CV iterators #2904 (comment)).

What's done so far

This pull request handles the migration from grid_search to model_selectionand shows an idea of how we could handle both the old and the new cross validation syntax with the same code, to avoid duplication.

The implementation is not bulletproof, but looks promising so far.

But, as @vene points out, having these two changes implemented in the same pull request makes looking at diffs and reviewing the code difficult.

What do you think about splitting this changeset into three parts, as in:

Module migration (deprecation warnings in old module, new docs + tests)
Data independent CV iterators (deprecation warnings for old methods, docs +
tests)
Extending the allowed data types for the y value.

I think these would make the changes more simple to implement and review.

@ragv has shown interest into finishing this, and I can help too. Splitting the changeset should help us make progress.

Thanks for reading up to here, and feel free to comment your thoughts :)

raghavrv · 2015-03-16T20:53:58Z

sklearn/model_selection/partition.py

+    def __init__(self, n, n_folds, indices, shuffle, random_state):
+        super(_BaseKFold, self).__init__(n, indices)
+
+        if abs(n_folds - int(n_folds)) >= np.finfo('f').eps:


@amueller Wouldn't a simple if type(n_folds) is not int: suffice here?

type(n_folds) is not good, as it could be a numpy int. A isinstance(n_folds, numeric.Integer) would work, though.

Ah! Thanks :)

raghavrv · 2015-10-09T15:01:38Z

@ogrisel @amueller This can be closed since we have #4294 which will be merged very soon(?)

amueller · 2015-10-09T16:58:59Z

when it's merged

jnothman · 2015-11-05T21:29:13Z

yay!

On 6 November 2015 at 07:25, Manoj Kumar notifications@github.com wrote:

Closed #3340 #3340.

—
Reply to this email directly or view it on GitHub
#3340 (comment).

raghavrv · 2015-11-05T21:34:19Z

:D

jnothman changed the title ~~Data independent CV and model_selection module~~ [WIP] Data independent CV and model_selection module Jul 4, 2014

jnothman reviewed Jul 4, 2014
View reviewed changes

sklearn/grid_search.py

from sklearn.model_selection.search import GridSearchCV, RandomizedSearchCV

Copy link

Member

jnothman Jul 4, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use relative imports

pignacio added 6 commits July 8, 2014 10:32

Move cross_variation module into model_selection

030a1cd

Move learning_curve module into model_selection.validate

f8c4b04

Move grid_search module inside model_selection

9b93e7f

Working split method for KFold and StratifiedKFold

991e001

Removed _PartitionTestGenerator

089f00c

LeaveOut and LeaveLabelOut working with split

ec04270

GaelVaroquaux reviewed Jul 14, 2014
View reviewed changes

kastnerkyle reviewed Jul 15, 2014
View reviewed changes

vene mentioned this pull request Jul 16, 2014

[WIP] sample_weight support #1574

Closed

6 tasks

jnothman mentioned this pull request Aug 2, 2014

[WIP] scorer_params and sample_weight support #3524

Closed

4 tasks

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

raghavrv mentioned this pull request Feb 16, 2015

[WIP] Clean up of cross_validation et al. and model_selection refactoring #4254

Closed

raghavrv mentioned this pull request Feb 25, 2015

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

Merged

24 tasks

raghavrv reviewed Mar 16, 2015
View reviewed changes

MechCoder closed this Nov 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Data independent CV and model_selection module #3340

[WIP] Data independent CV and model_selection module #3340

pignacio commented Jul 3, 2014

jnothman commented Jul 4, 2014

jnothman commented Jul 4, 2014

jnothman commented Jul 4, 2014

jnothman Jul 4, 2014

jnothman commented Jul 4, 2014

pignacio commented Jul 10, 2014

jnothman commented Jul 10, 2014

jnothman commented Jul 11, 2014

GaelVaroquaux Jul 14, 2014

GaelVaroquaux commented Jul 14, 2014

kastnerkyle Jul 15, 2014

kastnerkyle commented Jul 15, 2014

vene commented Jul 16, 2014

vene commented Jul 24, 2014

jnothman commented Jul 24, 2014

pignacio commented Jul 24, 2014

jnothman commented Jul 25, 2014

raghavrv commented Feb 16, 2015

jnothman commented Feb 16, 2015

pignacio commented Feb 16, 2015

raghavrv commented Feb 16, 2015

pignacio commented Feb 19, 2015

raghavrv Mar 16, 2015

amueller Mar 16, 2015

raghavrv Mar 16, 2015

raghavrv commented Oct 9, 2015

amueller commented Oct 9, 2015

jnothman commented Nov 5, 2015

raghavrv commented Nov 5, 2015


		from sklearn.model_selection.search import GridSearchCV, RandomizedSearchCV

[WIP] Data independent CV and model_selection module #3340

[WIP] Data independent CV and model_selection module #3340

Conversation

pignacio commented Jul 3, 2014

jnothman commented Jul 4, 2014

jnothman commented Jul 4, 2014

jnothman commented Jul 4, 2014

jnothman Jul 4, 2014

Choose a reason for hiding this comment

jnothman commented Jul 4, 2014

pignacio commented Jul 10, 2014

jnothman commented Jul 10, 2014

jnothman commented Jul 11, 2014

GaelVaroquaux Jul 14, 2014

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 14, 2014

kastnerkyle Jul 15, 2014

Choose a reason for hiding this comment

kastnerkyle commented Jul 15, 2014

vene commented Jul 16, 2014

vene commented Jul 24, 2014

jnothman commented Jul 24, 2014

pignacio commented Jul 24, 2014

jnothman commented Jul 25, 2014

raghavrv commented Feb 16, 2015

jnothman commented Feb 16, 2015

pignacio commented Feb 16, 2015

raghavrv commented Feb 16, 2015

pignacio commented Feb 19, 2015

A brief recap first:

What's done so far

raghavrv Mar 16, 2015

Choose a reason for hiding this comment

amueller Mar 16, 2015

Choose a reason for hiding this comment

raghavrv Mar 16, 2015

Choose a reason for hiding this comment

raghavrv commented Oct 9, 2015

amueller commented Oct 9, 2015

jnothman commented Nov 5, 2015

raghavrv commented Nov 5, 2015