Skip to content

FIX avoid parameter validation in init in TweedieRegressor #22124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

marenwestermann
Copy link
Member

Reference Issues/PRs

Towards #21406

What does this implement/fix? Explain your changes.

I removed the parameter validation of the attribute family in TweedieRegressor.__init__ and changed the tests accordingly.

Any other comments?

The test failure described in #21406 was due to the attribute family being set to TweedieDistribution in init (family=TweedieDistribution(power=power)).

In the module sklearn/linear_model/_glm/glm.py the PoissonRegressor, GammaRegressor and TweedieRegressor have the class GeneralizedLinearRegressor as a shared parent class. In the first two estimators the family parameter is set as a string. I therefore changed the family in TweedieRegressor to be a string as well, however my implementation is not backward compatible.
I introduced the method _get_family_instance in GeneralizedLinearRegressor and TweedieRegressor in order to get the correct distribution for the regressors and to set the power correctly in TweedieRegressor.

ping @adrinjalali @rth

@adrinjalali
Copy link
Member

I find it odd that family is both an attribute and also a property. I'm not sure why we're making it immutable by having those properties really. If the user sets an invalid value, it should raise in fit where validation happens, like other estimators. Maybe @rth remembers why this is done this way?

I'm happy with the change, but as you say, it's backward incompatible. I wonder what @jnothman thinks about it.

This PR also needs a changelog entry.

@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2022

I find it odd that family is both an attribute and also a property. I'm not sure why we're making it immutable by having those properties really. If the user sets an invalid value, it should raise in fit where validation happens, like other estimators.

I share the same feeling. Maybe @lorentzenchr can comment on this PR.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @adrinjalali and to be more specific, I think the following should work although I am not 100% sure.

family_instance : TweedieDistribution
Returns an instance of TweedieDistribution.
"""
return TweedieDistribution(power=self.power)
Copy link
Member

@ogrisel ogrisel Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be in favor of removing the TweedieRegressor.family property getter and setter and instead validate the consistency of self.family in _get_family_instance which should be only called at fit time instead of __init__ or set_params time:

Suggested change
return TweedieDistribution(power=self.power)
if isinstance(self.family, TweedieDistribution):
# ignore self.power if the user passed a distribution instance directly
# for backward compat
return self.family
elif self.family == "tweedie":
return TweedieDistribution(power=self.power)
else:
raise ValueError(
"TweedieRegressor.family must always be 'tweedie', "
f"got: {self.family!r}"
)

disclaimer: untested, there might be typos.

@rth
Copy link
Member

rth commented Jan 5, 2022

I find it odd that family is both an attribute and also a property. I'm not sure why we're making it immutable by having those properties really. If the user sets an invalid value, it should raise in fit where validation happens, like other estimators.

The issue was that PoissonRegressor and GammaRegressor are subclasses of TweedieRegression which has this public family attribute. So if we don't make this attribute read-only it means that one can inadvertently do,

est = PoissonRegressor()
est.family = "gamma"

and get an estimator that is not actually a PoissonRegressor. We could enforce this in fit, but then it means that TweedieRegressor.fit needs to be aware of its subclasses PoissonRegressor which is a bit contrary to how class inheritance works.

Generally, +1 to have a method _get_family_instance and validate what we can for TweedieRegressor.

I was really convinced we already had this common test for not validating in init when this PR was made. And it's the first time I hear about validation in set_params being inconsistent with the API though it makes sense in retrospective.

@adrinjalali
Copy link
Member

The issue was that PoissonRegressor and GammaRegressor are subclasses of TweedieRegression which has this public family attribute. So if we don't make this attribute read-only it means that one can inadvertently do,

est = PoissonRegressor()
est.family = "gamma"

and get an estimator that is not actually a PoissonRegressor. We could enforce this in fit, but then it means that TweedieRegressor.fit needs to be aware of its subclasses PoissonRegressor which is a bit contrary to how class inheritance works.

I think we do have other estimators which can be tampered with to put them in an odd state. I'd say we don't have to worry about users doing odd things like the one you mention. If we do, we can add a _validate_family which is overwritten by child classes and called in the parent class to make sure the family is in tune with the class itself.

So it seems to me that we have a consensus to remove the family properties, validate only in _get_family_instance, and maybe add a _validate_family?

@rth
Copy link
Member

rth commented Jan 5, 2022

I'm OK with it, maybe let's also wait for @lorentzenchr's feedback.

@lorentzenchr
Copy link
Member

I'm +0, so no objection. Is it possible to rename the family parameter in the subclasses, e.g. to _family to make it "private"?

@adrinjalali
Copy link
Member

I'm not sure if it'd make sense to have a private property to check the validity of the value. If it's private, then people won't be changing it, I guess?

@marenwestermann I think you can go ahead with the changes now.

@lorentzenchr
Copy link
Member

I opened #22548 for very different reasons. But the refactoring there also solves the issue of this PR.

@marenwestermann marenwestermann deleted the tweedie-regressor branch March 29, 2022 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants