-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] ENH Ensure consistency in splits and in parameters (without causing memory blowup because of materializing the iterator) #7941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if self.verbose > 0 and isinstance(parameter_iterable, Sized): | ||
n_candidates = len(parameter_iterable) | ||
# Regenerate parameter iterable for each fit | ||
candidate_params = list(self._param_iterable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stores all the dicts as a list but this should not cause any (additional) memory (compared to 0.18) as it is anyway being stored a couple of lines later. Infact since we do not return the parameters and store them at out
, it should essentially reduce the memory consumption..
Also I couldn't find a (non-hacky) way to safely not use lists in One think I think would be appropriate there (and elsewhere) would be to try iterate the cv twice to check if it is reentrant and raise an error asking them to pass a reentrant one or materialize outside and pass it as a list to us making things more explicit. Or if you feel it is fine to have it as such for This and #7935 should fix most/all of the concerns raised previously while being backward compatible with 0.17... |
@@ -532,25 +532,51 @@ def inverse_transform(self, Xt): | |||
self._check_is_fitted('inverse_transform') | |||
return self.best_estimator_.transform(Xt) | |||
|
|||
def _fit(self, X, y, groups, parameter_iterable): | |||
"""Actual fitting, performing the search over parameters.""" | |||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use polymorphism? This looks like reimplementing polymorphism based on attribute names...
cv_iter = list(cv.split(X, y, groups)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we still materializing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do away with materializing for learning_curve
. (Also it was like that before pre-0.18)
@@ -532,25 +532,51 @@ def inverse_transform(self, Xt): | |||
self._check_is_fitted('inverse_transform') | |||
return self.best_estimator_.transform(Xt) | |||
|
|||
def _fit(self, X, y, groups, parameter_iterable): | |||
"""Actual fitting, performing the search over parameters.""" | |||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this helps anything and perhaps the previous version was easier to read...?
@jnothman It combines the 2 fit functions and saves us some lines. Could you check if after the recent changes, you still feel having separate fit functions is a better option? |
@amueller Have used proper polymorphism... One more pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests aren't running due to syntax error, otherwise the code looks functionally fine.
I'm still not sure _param_iterable
is the right design. It is very anti-functional to store something on the object that then somewhat-secretly gets accessed by fit
.
@property | ||
def _param_iterable(self): | ||
"""To generate parameter iterables for multiple iterations""" | ||
return ParameterGrid(self.param_grid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndentationError
@jnothman I've used an explicit getter function instead of |
If we want to encourage users to make their own BaseSearchCV extensions, I
suppose the latest version is best...?
…On 6 December 2016 at 02:01, Raghav RV ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I've used an explicit getter
function instead of property... Let me know if this is better. If not we
could revert back to the previous setup of separate fit calls per class...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7941 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yDXCjMq0QxYc8FWtk5WCqXOPPPAks5rFCcvgaJpZM4K8mK2>
.
|
By latest version, you mean the PR in its current state? If yes, I need to make an abstract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
for parameters in parameter_iterable | ||
for train, test in cv_iter) | ||
for train, test in cv.split(X, y, groups) | ||
for parameters in candidate_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return_parameters=False
is unrelated, right? but it seems fine.
@@ -128,7 +128,6 @@ def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, | |||
X, y, groups = indexable(X, y, groups) | |||
|
|||
cv = check_cv(cv, y, classifier=is_classifier(estimator)) | |||
cv_iter = list(cv.split(X, y, groups)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we put this here at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paranoia on my part I guess when I learnt model_selection
doesn't handle one time cv-splitters and rushed to push in a fix... :/
Have added your +1. Could you also github approve? |
Thanks :) |
Yes, make an abstract `get_parameter_iterable`. But actually, it should be
`get_parameter_iterator`, no? Whether we then make BaseSearchCV public (!),
I'm not sure...
…On 6 December 2016 at 09:59, Raghav RV ***@***.***> wrote:
Thanks :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7941 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz606bVzck7qhYWwQaYelveSL-CYNSks5rFJdagaJpZM4K8mK2>
.
|
I would not make it public and not add abstract method. If someone wants to implement their own, let them. But I don't think there is a lot of worth in adding this to the public interface. |
Okay with that @jnothman are you +1 for merge then? |
Without further resolve, we need to make it |
Done... ;) |
LGTM |
cross_val_predict does not work with TimeSeriesSplit. |
@yogeveran Could you raise a new issue at https://github.com/scikit-learn/scikit-learn/issues. Make sure to include a sample snippet which highlights your issue... |
I think we should backport this fix to the 0.18.X branch if we plan to release 0.18.2. |
Perhaps although the only time we've seen the issue raised this was the
wrong diagnosis.
…On 7 January 2017 at 23:17, Olivier Grisel ***@***.***> wrote:
I think we should backport this fix to the 0.18.X branch if we plan to
release 0.18.2.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#7941 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_ivcw5XASDg9fat5A_w11eoUvxHks5rP4JlgaJpZM4K8mK2>
.
|
…earn#7941) * ENH Parallelize by candidates first then by splits. * ENH do not materialize a cv iterator to avoid memory blow ups.
…earn#7941) * ENH Parallelize by candidates first then by splits. * ENH do not materialize a cv iterator to avoid memory blow ups.
…earn#7941) * ENH Parallelize by candidates first then by splits. * ENH do not materialize a cv iterator to avoid memory blow ups.
…earn#7941) * ENH Parallelize by candidates first then by splits. * ENH do not materialize a cv iterator to avoid memory blow ups.
…earn#7941) * ENH Parallelize by candidates first then by splits. * ENH do not materialize a cv iterator to avoid memory blow ups.
A better patch over my previous #7660 for ensuring consistency of splits (and in parameters) without the additional memory required to store the cv indices (as was done in #7660). This should hopefully address the concerns raised there reg. the memory consumption.
@jnothman @GaelVaroquaux @amueller All the tests seem to pass cleanly.