Skip to content

[MRG] Refactor pipeline namespace to make it more reusable #11446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

caioaao
Copy link
Contributor

@caioaao caioaao commented Jul 5, 2018

Related: #8960
Explanation: #8960 (comment)

Basically Pipeline and FeatureUnion classes are really useful for combining models and making the code more extensible may provide a nice starting point for other features. My personal use case is building a stacked generalization framework of of it.

Summary of changes

  • Extracted code to stack estimators' outputs in a private method (_stack_results);
  • Extracted snippet responsible for applying weights to results (_apply_weights`);
  • Added functions that were extracted for parallel computation as attributes of FeatureUnion and Pipeline so they can be overridden when needed.

@caioaao caioaao changed the title Refactor pipeline namespace to make it more reusable [MRG] Refactor pipeline namespace to make it more reusable Jul 9, 2018
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on stacking @caioaao !

A few comments are below, but generally this LGTM. Note that if you create a new package that depends on this, there is no guarantee that these private functions won't change in the future.

return _apply_weight(res, weight)


def _fit_transform_one(transformer, X, y, weight, **fit_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It suppose it doesn't work if you leave _fit_transform_one in the same place as where it was before?

@property
def _fit_transform_one(self):
# property needed because `memory.cache` doesn't accept class methods
return _fit_transform_one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe,

_fit_transform_one = property(_fit_transform_one)

above the __init__, to be more concise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry, I meant,

_fit_transform_one = staticmethod(_fit_transform_one)

Would that work?

Copy link
Contributor Author

@caioaao caioaao Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but I don't think so, since I think it needs to return _fit_transform_one function for parallel to work. I'll give it a try anyway

EDIT: it works :)


@property
def _transform_one(self):
return _transform_one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Add a comment that these are defined to facilitate customizing behavior by subclassing.

for _, trans, _ in self._iter())
self._update_transformer_list(transformers)
return self

def _stack_results(self, Xs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decorate with @staticmethod ?

# if we have a weight for this transformer, multiply output
if weight is None:
return Xt
return Xt * weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there is really a need to factorize the code into this additional function. Here we only have 2 repetitions with little code, and a rule of thumb (source forgotten) is to factorize after 3..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the code base right now I'd agree with you, but it won't hurt to keep it as a static method (and I use this as well 😬 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but if it's only used in _transform_one and _fit_transform_one you could always overwrite those two functions and define _apply_weight as need, can't you? (or copy this function into your package).

Copy link
Contributor Author

@caioaao caioaao Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but what would be the benefit of doing so? Another benefit from wrapping snippets in functions, besides reusability, is readability. This way you don't need a comment describing the purpose of a snippet

@caioaao
Copy link
Contributor Author

caioaao commented Jul 10, 2018

@rth thanks! that's something I haven't considered, but I guess I'll have to cross that bridge when we come to it. hopefully we'll have stacking in scikit learn at that point 🤞

Xs = np.hstack(Xs)
return Xs

return self._stack_results(Xs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this _stack_arrays ?

# if we have a weight for this transformer, multiply output
if weight is None:
return Xt
return Xt * weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but if it's only used in _transform_one and _fit_transform_one you could always overwrite those two functions and define _apply_weight as need, can't you? (or copy this function into your package).

@amueller
Copy link
Member

amueller commented Aug 5, 2019

Is this still relevant?

Base automatically changed from master to main January 22, 2021 10:50
@thomasjpfan
Copy link
Member

Now that stacking is merged in #11047, I do not think this PR is required anymore. With that in mind, I am closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants