-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm -1 for moving cross_validation generators if they're getting a rewrite and we want to be able to use the same names. |
c652e0f
to
758cdc2
Compare
It's fine to do so as long as you ensure that they're released at the same On 26 February 2015 at 00:10, ragv notifications@github.com wrote:
|
Sure! I'll make sure all these go into the
I agree, but that would make reviewing a tad easier, with lesser diffs to look at, I feel :) |
Or wait... like @amueller said in an earlier comment, we could put this in a branch and send the next two PRs to that... would it be better? |
really as I said in other places, I thought the whole point of doing the move now is having a deprecation path for the cross validation objects. |
Agreed Sent from my phone. Please forgive brevity and mis spelling On Feb 25, 2015, 19:34, at 19:34, Andreas Mueller notifications@github.com wrote:
|
+1 |
758cdc2
to
5d28b86
Compare
As said earlier I am converting the PR as a full fix for #2904 alone ... Though I am not even 20% done with the same goal, It would be helpful to know if there are any oppositions to this early on :) Will this fix (without clubbing together the fix for #1848) be considered for merge? Andreas has expressed his +1 towards the same... Any more suggestions / critiques? |
I think it is pretty save to go ahead with that. Bundling multiple changes in a PR is rarely a good idea, as long as they can be separated in a sensible way. |
Thanks for the comment :) |
6c4edb4
to
e42cc57
Compare
@property | ||
def n_unique_labels(self): | ||
_deprecate_label_attribute(self, "n_unique_labels") | ||
return len(np.unique(self.labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller unique_labels
and n_unique_labels
were previously defined in the __init__
and hence were publicly available as an attribute... Since we deprecate initializing data in the __init__
, I've deprecated them this way... Could you kindly let me know if this seems sane? ( Or should we just remove those attributes without any deprecation? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just use the @deprecated
decorator on the properties and not define a new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise this is the right way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not so much about initializing in __init__
(which wouldn't be a problem here since these are not estimators) than warning only of someone actually uses them. But you did the right thing.
Does this seem like a good docstring for all the split methods? [Sorry for re-commenting again] |
@ogrisel Will this be included for 0.16? (if not is it safe to remove |
d7ea617
to
322ee69
Compare
Sorry I am just piling one comment upon another :/ How would you feel if |
322ee69
to
37dd8f5
Compare
@@ -80,16 +92,36 @@ def indices(self): | |||
return self._indices | |||
|
|||
def __iter__(self): | |||
return self.split(None) | |||
|
|||
def split(self, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called y
? Maybe "array" or something would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data is X. y isn't data, is it? But this would also be applied to y, right? Or either? Sorry, I haven't read enough of the rest of the PR.
This will not be included in 0.16, but bootstrap will only be removed in 0.17. |
You are right, bootstrap can savely be removed, and will be in #4370 (I'll fix that in a second). |
@amueller Thanks a lot for the comments!! :) That cleared up a few things... :) Just one more additional issue - How would you feel if |
LeavePOut, ShuffleSplit, LabelShuffleSplit, StratifiedKFold, | ||
StratifiedShuffleSplit, PredefinedSplit) | ||
|
||
LABEL_CVS = (LabelKFold, LeaveOneLabelOut, LeavePLabelOut, LabelShuffleSplit,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those constants shouldn't be defined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a separate file, _constants.py
inside model_selection
? (This will also hold the best param dict that will be added by @zermelozf soon?)
Also @amueller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have moved this to _validation.py
as adviced by Arnaud and Andy IRL.
7972165
to
ca9517b
Compare
Main Commits - Major -------------------- * ENH Reogranize classes/fn from grid_search into search.py * ENH Reogranize classes/fn from cross_validation into split.py * ENH Reogranize cls/fn from cross_validation/learning_curve into validate.py * MAINT Merge _check_cv into check_cv inside the model_selection module * MAINT Update all the imports to point to the model_selection module * FIX use iter_cv to iterate throught the new style/old style cv objs * TST Add tests for the new model_selection members * ENH Wrap the old-style cv obj/iterables instead of using iter_cv * ENH Use scipy's binomial coefficient function comb for calucation of nCk * ENH Few enhancements to the split module * ENH Improve check_cv input validation and docstring * MAINT _get_test_folds(X, y, labels) --> _get_test_folds(labels) * TST if 1d arrays for X introduce any errors * ENH use 1d X arrays for all tests; * ENH X_10 --> X (global var) Minor ----- * ENH _PartitionIterator --> _BaseCrossValidator; * ENH CVIterator --> CVIterableWrapper * TST Import the old SKF locally * FIX/TST Clean up the split module's tests. * DOC Improve documentation of the cv parameter * COSMIT consistently hyphenate cross-validation/cross-validator * TST Calculate n_samples from X * COSMIT Use separate lines for each import. * COSMIT cross_validation_generator --> cross_validator Commits merged manually ----------------------- * FIX Document the random_state attribute in RandomSearchCV * MAINT Use check_cv instead of _check_cv * ENH refactor OVO decision function, use it in SVC for sklearn-like decision_function shape * FIX avoid memory cost when sampling from large parameter grids
Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests - From scikit-learn#5378 - Make the GridSearchCV docs more accurate. - From scikit-learn#5458 - Remove shuffle from LabelKFold - From scikit-learn#5466(scikit-learn#4270) - Gaussian Process by Jan Metzen - From scikit-learn#4826 - Move custom error / warnings into sklearn.exception Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc
COSMIT Sort the members alphabetically COSMIT len_cv --> n_splits COSMIT Merge 2 if; FIX Use kwargs DOC Add my name to the authors :D DOC make labels parameter consistent FIX Remove hack for boolean indices; + COSMIT idx --> indices; DOC Add Returns COSMIT preds --> predictions DOC Add Returns and neatly arrange X, y, labels FIX idx(s)/ind(s)--> indice(s) COSMIT Merge if and else to elif COSMIT n --> n_samples COSMIT Use bincount only once COSMIT cls --> class_i / class_i (ith class indices) --> perm_indices_class_i
COSMIT c --> count FIX/TST make check_cv raise ValueError for string cv value TST nested cv (gs inside cross_val_score) works for diff cvs FIX/ENH Raise ValueError when labels is None for label based cvs; TST if labels is being passed correctly to the cv and that the ValueError is being propagated to the cross_val_score/predict and grid search FIX pass labels to cross_val_score FIX use make_classification DOC Add Returns; COSMIT Remove scaffolding TST add a test to check the _build_repr helper REVERT the old GS/RS should also be tested by the common tests. ENH Add a tuple of all/label based CVS FIX raise VE even at get_n_splits if labels is None FIX Fabian's comments PEP8
74ec175
to
6a0d2fc
Compare
[MRG+1] Reorganize grid_search, cross_validation and learning_curve into model_selection
Blame me if anything breaks. |
🍻 |
1 similar comment
🍻 |
Thanks a lot Andy, Vlad, Joel and Arnaud for the reviews and merge 🍻 :) |
so how to use now starting from 0? On Fri, Oct 23, 2015 at 11:50 AM, Raghav R V notifications@github.com
|
fixes #1848, #2904
Things to be done after this PR - Issue at #5053 ( PR at #5569 )
TODO
model_selection
module.check_cv
so both old style and new style classes can be used)model_selction
tests pass[MRG+1] FIX all the examples to use the new cv classes raghavrv/scikit-learn#3merged into [MRG] Model selection documentation raghavrv/scikit-learn#4!MINOR
RenameMoved to [RFC] Changes to model_selection? #5053p
to a better name._check_is_partition
-->_check_is_permutation
?_empty_mask
Open Discussions
labels
arg - Refer discussion here [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) and here [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) -labels
added to the last.__init__
... - Now successive split calls return similar results (whenrandom_state
is set)labels
to the inner cv inpermutation_test_score
- Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - ping @agramfort - For now yes. - ([MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment))CVIteratorWrapper
private? - Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - Made privateNOTE:
The current implementation will still not support nesting
EstimatorCV
insideGridSearchCV
... This will become possible only after allowingsample_properties
to pass fromGridSearchCV
toEstimatorCV
...PRs whose changes to g_s / c_v / l_c have been manually incorporated into this:
#4714 - Svm decision function shape - 1 commit
#4829 - merge _check_cv into check_cv ... - 1 commit
#4857 - Document missing attribute
random_state
inRandomizedSearchCV
- 1 commit#4840 - FIX avoid memory cost when sampling from large parameter grids - 1 commit
#5194 (Refer #5238) - Consistent CV docstring
#5161 check for sparse pred in
cross_val_predict
#5201 clarify random state in
KFold
#5190
LabelKFold
#4583
LabelShuffleSplit
#5283 Remove some warnings in grid search tests
#5300 shuffle
labels
notidxs
and tests to ensure it.This PR is slightly based upon @pignacio's work in #3340.
@amueller's hack:
if you want to align diffs you can do this (in ipython notebook)