-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Refactor pipeline namespace to make it more reusable #11446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on stacking @caioaao !
A few comments are below, but generally this LGTM. Note that if you create a new package that depends on this, there is no guarantee that these private functions won't change in the future.
return _apply_weight(res, weight) | ||
|
||
|
||
def _fit_transform_one(transformer, X, y, weight, **fit_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It suppose it doesn't work if you leave _fit_transform_one
in the same place as where it was before?
sklearn/pipeline.py
Outdated
@property | ||
def _fit_transform_one(self): | ||
# property needed because `memory.cache` doesn't accept class methods | ||
return _fit_transform_one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe,
_fit_transform_one = property(_fit_transform_one)
above the __init__
, to be more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sorry, I meant,
_fit_transform_one = staticmethod(_fit_transform_one)
Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but I don't think so, since I think it needs to return _fit_transform_one
function for parallel to work. I'll give it a try anyway
EDIT: it works :)
sklearn/pipeline.py
Outdated
|
||
@property | ||
def _transform_one(self): | ||
return _transform_one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. Add a comment that these are defined to facilitate customizing behavior by subclassing.
sklearn/pipeline.py
Outdated
for _, trans, _ in self._iter()) | ||
self._update_transformer_list(transformers) | ||
return self | ||
|
||
def _stack_results(self, Xs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decorate with @staticmethod
?
# if we have a weight for this transformer, multiply output | ||
if weight is None: | ||
return Xt | ||
return Xt * weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there is really a need to factorize the code into this additional function. Here we only have 2 repetitions with little code, and a rule of thumb (source forgotten) is to factorize after 3..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the code base right now I'd agree with you, but it won't hurt to keep it as a static method (and I use this as well 😬 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but if it's only used in _transform_one
and _fit_transform_one
you could always overwrite those two functions and define _apply_weight
as need, can't you? (or copy this function into your package).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but what would be the benefit of doing so? Another benefit from wrapping snippets in functions, besides reusability, is readability. This way you don't need a comment describing the purpose of a snippet
@rth thanks! that's something I haven't considered, but I guess I'll have to cross that bridge when we come to it. hopefully we'll have stacking in scikit learn at that point 🤞 |
sklearn/pipeline.py
Outdated
Xs = np.hstack(Xs) | ||
return Xs | ||
|
||
return self._stack_results(Xs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this _stack_arrays
?
# if we have a weight for this transformer, multiply output | ||
if weight is None: | ||
return Xt | ||
return Xt * weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but if it's only used in _transform_one
and _fit_transform_one
you could always overwrite those two functions and define _apply_weight
as need, can't you? (or copy this function into your package).
Is this still relevant? |
Now that stacking is merged in #11047, I do not think this PR is required anymore. With that in mind, I am closing this PR. |
Related: #8960
Explanation: #8960 (comment)
Basically
Pipeline
andFeatureUnion
classes are really useful for combining models and making the code more extensible may provide a nice starting point for other features. My personal use case is building a stacked generalization framework of of it.Summary of changes
_stack_results
);FeatureUnion
andPipeline
so they can be overridden when needed.