-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a changelog entry, otherwise LGTM.
b19ab96
to
321f145
Compare
Thanks for the review @adrinjalali. I updated this PR with the following changes:
|
Hmm, this is changing some values in a docstring we have (hence fail in CI) |
321f145
to
2261ad3
Compare
@lsorber please avoid force pushing so that we can see the history of changes. At the end we squash and merge anyway. |
@adrinjalali Apologies, I thought I had rebased on the latest upstream It seems that the doctests fail because they expect to see specific lambda values for the given example. However, because of the new regularisation term on the exponent of the transformed variables, the lamdbas are slightly different from the expected lambdas (see below). How should we proceed from here?
|
maybe @lorentzenchr would have a better idea how to proceed here. |
Can we use |
I reviewed scipy's implementation and it's very similar to sklearn's implementation in that it is prone to both of the sources of overflow that this PR addresses. I am open to applying the improvements of this PR to scipy and then updating this PR to use scipy's implementation. Are you asking me to do this @lorentzenchr? Should we check with the scipy maintainers whether they are open to such a change? The new regularisation term is non-standard and so may meet some resistance. |
I think the idea is to not have our own implementation, and use scipy's when available, and suggest this fix on the scipy side. |
I opened #26308 to rely on scipy.stats.yeojohnson. |
Reference Issues/PRs
Fixes #23319
What does this implement/fix? Explain your changes.
This PR fixes two sources of overflow in the Yeo-Johnson power transform:
x_trans_var = x_trans.var()
out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda
The first type of overflow is caused by
np.power
. This PR mitigates this type of overflow by replacing all instances ofnp.power
with a numerically more robust formulation based onnp.exp
.The second type of overflow occurs when the exponents blow up for a marginal gain in log likelihood. This PR mitigates this type of overflow by adding a small regularization term on the exponents.
Non-regression tests for both types of overflow have been added.