[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

raghavrv · 2015-02-25T12:54:52Z

Things to be done after this PR - Issue at #5053 ( PR at #5569 )

TODO

MINOR

~~Rename p to a better name.~~ Moved to [RFC] Changes to model_selection? #5053
Rename _check_is_partition-->_check_is_permutation?
Remove _empty_mask
As Joel said here, use binomial coefficient instead of factorial.

Open Discussions

Order of labels arg - Refer discussion here [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) and here [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - labels added to the last.
Whether we would want to reshuffle the data at each split call - Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - Attempted to be fixed by commit "ENH+TST make rng to be generated at every split call for reproducibility" - Previously (at the time of @amueller's review) successive split calls generated different splits since rng was generated only once at the __init__... - Now successive split calls return similar results (when random_state is set)
Making the submodules (validation et al) private? - Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - Made private with 2 votes from @vene, @amueller
Can we safely pass labels to the inner cv in permutation_test_score - Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - ping @agramfort - For now yes. - ([MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment))
Make CVIteratorWrapper private? - Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - Made private
Deprecation window - Should we keep the old code till 1.0 or 0.18 or ?? - [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - Have updated to 0.19
https://github.com/scikit-learn/scikit-learn/pull/4294/files#r35808353 - Waiting for reply
Better way to test working of shuffle in (Stratified)KFold - Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment)

NOTE:
The current implementation will still not support nesting EstimatorCV inside GridSearchCV... This will become possible only after allowing sample_properties to pass from GridSearchCV to EstimatorCV...

PRs whose changes to g_s / c_v / l_c have been manually incorporated into this:
#4714 - Svm decision function shape - 1 commit
#4829 - merge _check_cv into check_cv ... - 1 commit
#4857 - Document missing attribute random_state in RandomizedSearchCV - 1 commit
#4840 - FIX avoid memory cost when sampling from large parameter grids - 1 commit
#5194 (Refer #5238) - Consistent CV docstring
#5161 check for sparse pred in cross_val_predict
#5201 clarify random state in KFold
#5190 LabelKFold
#4583 LabelShuffleSplit
#5283 Remove some warnings in grid search tests
#5300 shuffle labels not idxs and tests to ensure it.

This PR is slightly based upon @pignacio's work in #3340.

@amueller's hack:
if you want to align diffs you can do this (in ipython notebook)

import inspect
import difflib
from IPython.display import HTML

def show_func_diff(func_a, func_b):
    return HTML(difflib.HtmlDiff().make_file(inspect.getsourcelines(func_a)[0], inspect.getsourcelines(func_b)[0]))

from sklearn.cross_validation import cross_val_score as cross_val_score_old
from sklearn.model_selection import cross_val_score

show_func_diff(cross_val_score, cross_val_score_old)

jnothman · 2015-02-25T12:57:52Z

I'm -1 for moving cross_validation generators if they're getting a rewrite and we want to be able to use the same names.

raghavrv · 2015-02-25T13:10:04Z

I thought it would be easier to do it in the 3rd PR once this PR gets merged? ( 2nd PR ( #4254 ) would completely fix #1848 and 3rd ( #3340 ) would completely fix #2904 )?

jnothman · 2015-02-25T14:20:39Z

It's fine to do so as long as you ensure that they're released at the same
time... which defeats a little of the purpose of splitting it up.

On 26 February 2015 at 00:10, ragv notifications@github.com wrote:

I thought it would be easier to do it in the 3rd PR once this PR gets
merged? ( 2nd PR ( #4254
#4254 ) would
completely fix #1848
#1848 and 3rd ( #3340
#3340 ) would
completely fix #2904
#2904 )?

—
Reply to this email directly or view it on GitHub
#4294 (comment)
.

raghavrv · 2015-02-25T15:01:41Z

It's fine to do so as long as you ensure that they're released at the same
time...

Sure! I'll make sure all these go into the ~~0.16~~ not scheduled for 0.16 from recent comments

which defeats a little of the purpose of splitting it up.

I agree, but that would make reviewing a tad easier, with lesser diffs to look at, I feel :)

raghavrv · 2015-02-25T15:03:13Z

Or wait... like @amueller said in an earlier comment, we could put this in a branch and send the next two PRs to that... would it be better?

amueller · 2015-02-25T18:34:05Z

really as I said in other places, I thought the whole point of doing the move now is having a deprecation path for the cross validation objects.

GaelVaroquaux · 2015-02-25T19:07:47Z

Agreed

Sent from my phone. Please forgive brevity and mis spelling

On Feb 25, 2015, 19:34, at 19:34, Andreas Mueller notifications@github.com wrote:

really as I said in other places, I thought the whole point of doing
the move now is having a deprecation path for the cross validation
objects.

Reply to this email directly or view it on GitHub:
#4294 (comment)

raghavrv · 2015-02-25T19:36:31Z

@GaelVaroquaux @amueller @jnothman @pignacio

Thanks for the comments... I'm salvaging this as a fix for #2904 alone (picking up relevant parts from #3340) and provide a clean deprecation path... Hope thats okay? (The renaming/moving and fixing #1848 could be done in a separate PR).

amueller · 2015-02-25T19:39:50Z

+1

raghavrv · 2015-03-16T23:28:49Z

As said earlier I am converting the PR as a full fix for #2904 alone ... Though I am not even 20% done with the same goal, It would be helpful to know if there are any oppositions to this early on :)

Will this fix (without clubbing together the fix for #1848) be considered for merge?

Andreas has expressed his +1 towards the same... Any more suggestions / critiques?

amueller · 2015-03-17T14:19:13Z

I think it is pretty save to go ahead with that. Bundling multiple changes in a PR is rarely a good idea, as long as they can be separated in a sensible way.

raghavrv · 2015-03-17T21:49:58Z

Thanks for the comment :)

raghavrv · 2015-03-18T10:04:48Z

sklearn/cross_validation.py

+    @property
+    def n_unique_labels(self):
+        _deprecate_label_attribute(self, "n_unique_labels")
+        return len(np.unique(self.labels))


@amueller unique_labels and n_unique_labels were previously defined in the __init__ and hence were publicly available as an attribute... Since we deprecate initializing data in the __init__, I've deprecated them this way... Could you kindly let me know if this seems sane? ( Or should we just remove those attributes without any deprecation? )

You should just use the @deprecated decorator on the properties and not define a new function.

Otherwise this is the right way.

This is not so much about initializing in __init__ (which wouldn't be a problem here since these are not estimators) than warning only of someone actually uses them. But you did the right thing.

raghavrv · 2015-03-18T11:04:30Z

    """Generates the train, test indices based on the CV strategy used.

    Parameters
    ----------
     y : array-like, [n_samples]
            Samples to iterate on.
    """

Does this seem like a good docstring for all the split methods?

[Sorry for re-commenting again]

raghavrv · 2015-03-18T11:13:12Z

@ogrisel Will this be included for 0.16? (if not is it safe to remove Bootstrap as it was deprecated in 0.15?)

raghavrv · 2015-03-18T13:49:25Z

Sorry I am just piling one comment upon another :/

How would you feel if test_size and train_size parameters are deprecated for a new data independent test_train_ratio that can be safely initialized inside the __init__ itself?

amueller · 2015-03-18T18:41:18Z

sklearn/cross_validation.py

@@ -80,16 +92,36 @@ def indices(self):
        return self._indices

    def __iter__(self):
+        return self.split(None)
+
+    def split(self, y):


Why is this called y? Maybe "array" or something would be better?

How about data?

Data is X. y isn't data, is it? But this would also be applied to y, right? Or either? Sorry, I haven't read enough of the rest of the PR.

amueller · 2015-03-18T18:46:46Z

This will not be included in 0.16, but bootstrap will only be removed in 0.17.

amueller · 2015-03-18T18:48:41Z

You are right, bootstrap can savely be removed, and will be in #4370 (I'll fix that in a second).

raghavrv · 2015-03-18T19:03:23Z

@amueller Thanks a lot for the comments!! :) That cleared up a few things... :)

Just one more additional issue - How would you feel if test_size and train_size parameters are deprecated for a new data independent test_train_ratio that can be safely initialized inside the __init__ itself?

arjoly · 2015-10-23T10:05:13Z

sklearn/model_selection/__init__.py

+           LeavePOut, ShuffleSplit, LabelShuffleSplit, StratifiedKFold,
+           StratifiedShuffleSplit, PredefinedSplit)
+
+LABEL_CVS = (LabelKFold, LeaveOneLabelOut, LeavePLabelOut, LabelShuffleSplit,)


Those constants shouldn't be defined here.

Can we have a separate file, _constants.py inside model_selection? (This will also hold the best param dict that will be added by @zermelozf soon?)

Also @amueller

Have moved this to _validation.py as adviced by Arnaud and Andy IRL.

Main Commits - Major -------------------- * ENH Reogranize classes/fn from grid_search into search.py * ENH Reogranize classes/fn from cross_validation into split.py * ENH Reogranize cls/fn from cross_validation/learning_curve into validate.py * MAINT Merge _check_cv into check_cv inside the model_selection module * MAINT Update all the imports to point to the model_selection module * FIX use iter_cv to iterate throught the new style/old style cv objs * TST Add tests for the new model_selection members * ENH Wrap the old-style cv obj/iterables instead of using iter_cv * ENH Use scipy's binomial coefficient function comb for calucation of nCk * ENH Few enhancements to the split module * ENH Improve check_cv input validation and docstring * MAINT _get_test_folds(X, y, labels) --> _get_test_folds(labels) * TST if 1d arrays for X introduce any errors * ENH use 1d X arrays for all tests; * ENH X_10 --> X (global var) Minor ----- * ENH _PartitionIterator --> _BaseCrossValidator; * ENH CVIterator --> CVIterableWrapper * TST Import the old SKF locally * FIX/TST Clean up the split module's tests. * DOC Improve documentation of the cv parameter * COSMIT consistently hyphenate cross-validation/cross-validator * TST Calculate n_samples from X * COSMIT Use separate lines for each import. * COSMIT cross_validation_generator --> cross_validator Commits merged manually ----------------------- * FIX Document the random_state attribute in RandomSearchCV * MAINT Use check_cv instead of _check_cv * ENH refactor OVO decision function, use it in SVC for sklearn-like decision_function shape * FIX avoid memory cost when sampling from large parameter grids

Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests - From scikit-learn#5378 - Make the GridSearchCV docs more accurate. - From scikit-learn#5458 - Remove shuffle from LabelKFold - From scikit-learn#5466(scikit-learn#4270) - Gaussian Process by Jan Metzen - From scikit-learn#4826 - Move custom error / warnings into sklearn.exception Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc

COSMIT Sort the members alphabetically COSMIT len_cv --> n_splits COSMIT Merge 2 if; FIX Use kwargs DOC Add my name to the authors :D DOC make labels parameter consistent FIX Remove hack for boolean indices; + COSMIT idx --> indices; DOC Add Returns COSMIT preds --> predictions DOC Add Returns and neatly arrange X, y, labels FIX idx(s)/ind(s)--> indice(s) COSMIT Merge if and else to elif COSMIT n --> n_samples COSMIT Use bincount only once COSMIT cls --> class_i / class_i (ith class indices) --> perm_indices_class_i

COSMIT c --> count FIX/TST make check_cv raise ValueError for string cv value TST nested cv (gs inside cross_val_score) works for diff cvs FIX/ENH Raise ValueError when labels is None for label based cvs; TST if labels is being passed correctly to the cv and that the ValueError is being propagated to the cross_val_score/predict and grid search FIX pass labels to cross_val_score FIX use make_classification DOC Add Returns; COSMIT Remove scaffolding TST add a test to check the _build_repr helper REVERT the old GS/RS should also be tested by the common tests. ENH Add a tuple of all/label based CVS FIX raise VE even at get_n_splits if labels is None FIX Fabian's comments PEP8

…t :P

[MRG+1] Reorganize grid_search, cross_validation and learning_curve into model_selection

amueller · 2015-10-23T15:24:30Z

Blame me if anything breaks.

amueller · 2015-10-23T15:24:38Z

🍻

arjoly · 2015-10-23T15:25:41Z

🍻

raghavrv · 2015-10-23T15:50:04Z

Thanks a lot Andy, Vlad, Joel and Arnaud for the reviews and merge 🍻 :)

Sandy4321 · 2015-11-01T17:07:40Z

so how to use now starting from 0?

On Fri, Oct 23, 2015 at 11:50 AM, Raghav R V notifications@github.com
wrote:

Thanks a lot Andy, Vlad, Joel and Arnaud for the reviews and merge [image:
🍻] :)

—
Reply to this email directly or view it on GitHub
#4294 (comment)
.

raghavrv force-pushed the model_selection branch from c652e0f to 758cdc2 Compare February 25, 2015 12:58

raghavrv force-pushed the model_selection branch from 758cdc2 to 5d28b86 Compare March 16, 2015 23:15

raghavrv changed the title ~~Model selection~~ [WIP] Model selection Mar 16, 2015

raghavrv changed the title ~~[WIP] Model selection~~ [WIP] Make CV iterators data independent. Mar 16, 2015

raghavrv force-pushed the model_selection branch 2 times, most recently from 6c4edb4 to e42cc57 Compare March 18, 2015 01:14

raghavrv reviewed Mar 18, 2015
View reviewed changes

raghavrv force-pushed the model_selection branch 2 times, most recently from d7ea617 to 322ee69 Compare March 18, 2015 11:31

raghavrv force-pushed the model_selection branch from 322ee69 to 37dd8f5 Compare March 18, 2015 17:41

amueller reviewed Mar 18, 2015
View reviewed changes

arjoly reviewed Oct 23, 2015
View reviewed changes

zermelozf mentioned this pull request Oct 23, 2015

Add table / dictionary for good parameter values #5004

Open

raghavrv force-pushed the model_selection branch from 7972165 to ca9517b Compare October 23, 2015 12:47

raghavrv added 7 commits October 23, 2015 15:01

DOC Add whats new entry for model_selection module

e4e52b9

Move my name to the _split where my contributions are more significan…

a068691

…t :P

Move the *CVS constants to _validation as a dict

6a0d2fc

raghavrv force-pushed the model_selection branch from 74ec175 to 6a0d2fc Compare October 23, 2015 13:02

FIX use combinations_with_replacement from sklearn.utils.fixes

bdd94e9

amueller added a commit that referenced this pull request Oct 23, 2015

Merge pull request #4294 from rvraghav93/model_selection

646e47c

[MRG+1] Reorganize grid_search, cross_validation and learning_curve into model_selection

amueller merged commit 646e47c into scikit-learn:master Oct 23, 2015

raghavrv deleted the model_selection branch October 30, 2015 15:49

raghavrv mentioned this pull request Mar 10, 2016

[MRG+1] MNT remove ALL_CVS and LABEL_CVS from _validation.py #6517

Merged

1 task

raghavrv mentioned this pull request Apr 14, 2016

[MRG+1] ENH/MNT Rename labels --> groups #6660

Merged

9 tasks

raghavrv mentioned this pull request May 24, 2016

[MRG] Stratifiedkfold continuous (fixed) #6598

Open

raghavrv changed the title ~~[MRG+1] Reorganize grid_search, cross_validation and learning_curve into model_selection~~ [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection May 27, 2016

This was referenced Sep 12, 2016

[MRG+1] TST Stronger test for _check_is_permutation #7395

Merged

[MRG] MNT Use isinstance instead of dtype.kind check for scalar validation. #7394

Closed

xtuchyna mentioned this pull request Apr 16, 2021

Receiving Server Error on specific Pull Request events iteration PyGithub/PyGithub#1918

Closed

Uh oh!

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

Uh oh!

Conversation

raghavrv commented Feb 25, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Feb 25, 2015

Uh oh!

raghavrv commented Feb 25, 2015

Uh oh!

jnothman commented Feb 25, 2015

Uh oh!

raghavrv commented Feb 25, 2015

Uh oh!

raghavrv commented Feb 25, 2015

Uh oh!

amueller commented Feb 25, 2015

Uh oh!

GaelVaroquaux commented Feb 25, 2015

Uh oh!

raghavrv commented Feb 25, 2015

Uh oh!

amueller commented Feb 25, 2015

Uh oh!

raghavrv commented Mar 16, 2015

Uh oh!

amueller commented Mar 17, 2015

Uh oh!

raghavrv commented Mar 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Mar 18, 2015

Uh oh!

raghavrv commented Mar 18, 2015

Uh oh!

raghavrv commented Mar 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Mar 18, 2015

Uh oh!

amueller commented Mar 18, 2015

Uh oh!

raghavrv commented Mar 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 23, 2015

Uh oh!

amueller commented Oct 23, 2015

Uh oh!

arjoly commented Oct 23, 2015

Uh oh!

raghavrv commented Oct 23, 2015

Uh oh!

Sandy4321 commented Nov 1, 2015

Uh oh!

Uh oh!

raghavrv commented Feb 25, 2015 •

edited

Loading