Skip to content

[MRG + 1] ENH Ensure consistency in splits and in parameters (without causing memory blowup because of materializing the iterator) #7941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 7, 2016

Conversation

raghavrv
Copy link
Member

A better patch over my previous #7660 for ensuring consistency of splits (and in parameters) without the additional memory required to store the cv indices (as was done in #7660). This should hopefully address the concerns raised there reg. the memory consumption.

@jnothman @GaelVaroquaux @amueller All the tests seem to pass cleanly.

@raghavrv raghavrv added this to the 0.18.2 milestone Nov 25, 2016
if self.verbose > 0 and isinstance(parameter_iterable, Sized):
n_candidates = len(parameter_iterable)
# Regenerate parameter iterable for each fit
candidate_params = list(self._param_iterable)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stores all the dicts as a list but this should not cause any (additional) memory (compared to 0.18) as it is anyway being stored a couple of lines later. Infact since we do not return the parameters and store them at out, it should essentially reduce the memory consumption..

@raghavrv
Copy link
Member Author

Also I couldn't find a (non-hacky) way to safely not use lists in learning_curve which uses the cv splits multiple times.

One think I think would be appropriate there (and elsewhere) would be to try iterate the cv twice to check if it is reentrant and raise an error asking them to pass a reentrant one or materialize outside and pass it as a list to us making things more explicit. Or if you feel it is fine to have it as such for learning_curve alone, then we are good to go here I think...

This and #7935 should fix most/all of the concerns raised previously while being backward compatible with 0.17...

@@ -532,25 +532,51 @@ def inverse_transform(self, Xt):
self._check_is_fitted('inverse_transform')
return self.best_estimator_.transform(Xt)

def _fit(self, X, y, groups, parameter_iterable):
"""Actual fitting, performing the search over parameters."""
@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use polymorphism? This looks like reimplementing polymorphism based on attribute names...

cv_iter = list(cv.split(X, y, groups))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we still materializing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do away with materializing for learning_curve. (Also it was like that before pre-0.18)

@@ -532,25 +532,51 @@ def inverse_transform(self, Xt):
self._check_is_fitted('inverse_transform')
return self.best_estimator_.transform(Xt)

def _fit(self, X, y, groups, parameter_iterable):
"""Actual fitting, performing the search over parameters."""
@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this helps anything and perhaps the previous version was easier to read...?

@raghavrv raghavrv removed this from the 0.18.2 milestone Nov 30, 2016
@amueller amueller added this to the 0.19 milestone Nov 30, 2016
@raghavrv
Copy link
Member Author

raghavrv commented Dec 2, 2016

@jnothman It combines the 2 fit functions and saves us some lines. Could you check if after the recent changes, you still feel having separate fit functions is a better option?

@raghavrv
Copy link
Member Author

raghavrv commented Dec 2, 2016

@amueller Have used proper polymorphism... One more pass?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests aren't running due to syntax error, otherwise the code looks functionally fine.

I'm still not sure _param_iterable is the right design. It is very anti-functional to store something on the object that then somewhat-secretly gets accessed by fit.

@property
def _param_iterable(self):
"""To generate parameter iterables for multiple iterations"""
return ParameterGrid(self.param_grid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndentationError

@raghavrv
Copy link
Member Author

raghavrv commented Dec 5, 2016

@jnothman I've used an explicit getter function instead of property... Let me know if this is better. If not we could revert back to the previous setup of separate fit calls per class...

@jnothman
Copy link
Member

jnothman commented Dec 5, 2016 via email

@raghavrv
Copy link
Member Author

raghavrv commented Dec 5, 2016

the latest version

By latest version, you mean the PR in its current state? If yes, I need to make an abstract @get_parameter_iterable in the BaseSearchCV...

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

for parameters in parameter_iterable
for train, test in cv_iter)
for train, test in cv.split(X, y, groups)
for parameters in candidate_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_parameters=False is unrelated, right? but it seems fine.

@@ -128,7 +128,6 @@ def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None,
X, y, groups = indexable(X, y, groups)

cv = check_cv(cv, y, classifier=is_classifier(estimator))
cv_iter = list(cv.split(X, y, groups))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we put this here at all?

Copy link
Member Author

@raghavrv raghavrv Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paranoia on my part I guess when I learnt model_selection doesn't handle one time cv-splitters and rushed to push in a fix... :/

@raghavrv raghavrv changed the title [MRG] ENH Ensure consistency in splits and in parameters (without causing memory blowup because of materializing the iterator) [MRG + 1] ENH Ensure consistency in splits and in parameters (without causing memory blowup because of materializing the iterator) Dec 5, 2016
@raghavrv
Copy link
Member Author

raghavrv commented Dec 5, 2016

Have added your +1. Could you also github approve?

@raghavrv
Copy link
Member Author

raghavrv commented Dec 5, 2016

Thanks :)

@jnothman
Copy link
Member

jnothman commented Dec 5, 2016 via email

@amueller
Copy link
Member

amueller commented Dec 6, 2016

I would not make it public and not add abstract method. If someone wants to implement their own, let them. But I don't think there is a lot of worth in adding this to the public interface.

@raghavrv
Copy link
Member Author

raghavrv commented Dec 6, 2016

Okay with that @jnothman are you +1 for merge then?

@jnothman
Copy link
Member

jnothman commented Dec 6, 2016

Without further resolve, we need to make it _get_param_iterator (private)

@raghavrv
Copy link
Member Author

raghavrv commented Dec 6, 2016

Done... ;)

@jnothman
Copy link
Member

jnothman commented Dec 7, 2016

LGTM

@jnothman jnothman merged commit 3dcb873 into scikit-learn:master Dec 7, 2016
@yogeveran
Copy link

yogeveran commented Dec 12, 2016

cross_val_predict does not work with TimeSeriesSplit.
It raises there error here because some records are not predicted on.
if not _check_is_permutation(test_indices, _num_samples(X)): raise ValueError('cross_val_predict only works for partitions')

@raghavrv
Copy link
Member Author

@yogeveran Could you raise a new issue at https://github.com/scikit-learn/scikit-learn/issues. Make sure to include a sample snippet which highlights your issue...

@raghavrv raghavrv deleted the consistency_in_splits branch December 12, 2016 15:34
@ogrisel ogrisel modified the milestones: 0.18.2, 0.19 Jan 7, 2017
@ogrisel
Copy link
Member

ogrisel commented Jan 7, 2017

I think we should backport this fix to the 0.18.X branch if we plan to release 0.18.2.

@jnothman
Copy link
Member

jnothman commented Jan 7, 2017 via email

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…earn#7941)

* ENH Parallelize by candidates first then by splits.

* ENH do not materialize a cv iterator to avoid memory blow ups.
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…earn#7941)

* ENH Parallelize by candidates first then by splits.

* ENH do not materialize a cv iterator to avoid memory blow ups.
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…earn#7941)

* ENH Parallelize by candidates first then by splits.

* ENH do not materialize a cv iterator to avoid memory blow ups.
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…earn#7941)

* ENH Parallelize by candidates first then by splits.

* ENH do not materialize a cv iterator to avoid memory blow ups.
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…earn#7941)

* ENH Parallelize by candidates first then by splits.

* ENH do not materialize a cv iterator to avoid memory blow ups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants