Skip to content

[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lsorber
Copy link

@lsorber lsorber commented Apr 15, 2023

Reference Issues/PRs

Fixes #23319

What does this implement/fix? Explain your changes.

This PR fixes two sources of overflow in the Yeo-Johnson power transform:

  1. RuntimeWarning: overflow encountered in multiply from x_trans_var = x_trans.var()
  2. RuntimeWarning: overflow encountered in power from out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda

The first type of overflow is caused by np.power. This PR mitigates this type of overflow by replacing all instances of np.power with a numerically more robust formulation based on np.exp.

The second type of overflow occurs when the exponents blow up for a marginal gain in log likelihood. This PR mitigates this type of overflow by adding a small regularization term on the exponents.

Non-regression tests for both types of overflow have been added.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a changelog entry, otherwise LGTM.

@lsorber lsorber force-pushed the fix-power-transformer-overflows branch 2 times, most recently from b19ab96 to 321f145 Compare April 24, 2023 07:04
@lsorber
Copy link
Author

lsorber commented Apr 24, 2023

Thanks for the review @adrinjalali. I updated this PR with the following changes:

  1. I changed x_trans_exp = np.log(np.abs(x_trans)) to x_trans_exp = np.log(np.abs(x_trans[x_trans != 0])) to avoid RuntimeWarning: divide by zero encountered in log.
  2. I added a changelog entry as requested.
  3. I rebased the PR on the latest upstream main.

@adrinjalali
Copy link
Member

Hmm, this is changing some values in a docstring we have (hence fail in CI)

@lsorber lsorber force-pushed the fix-power-transformer-overflows branch from 321f145 to 2261ad3 Compare April 24, 2023 09:19
@adrinjalali
Copy link
Member

@lsorber please avoid force pushing so that we can see the history of changes. At the end we squash and merge anyway.

@lsorber
Copy link
Author

lsorber commented Apr 25, 2023

@adrinjalali Apologies, I thought I had rebased on the latest upstream main, but found that wasn't the case and so I did another rebase and force push. I'll avoid doing that from now.

It seems that the doctests fail because they expect to see specific lambda values for the given example. However, because of the new regularisation term on the exponent of the transformed variables, the lamdbas are slightly different from the expected lambdas (see below).

How should we proceed from here?

3076 
3077     Examples
3078     --------
3079     >>> import numpy as np
3080     >>> from sklearn.preprocessing import PowerTransformer
3081     >>> pt = PowerTransformer()
3082     >>> data = [[1, 2], [3, 2], [4, 5]]
3083     >>> print(pt.fit(data))
3084     PowerTransformer()
3085     >>> print(pt.lambdas_)
Expected:
    [ 1.386... -3.100...]
Got:
    [ 1.3761854 -3.091368 ]

/home/vsts/work/1/s/sklearn/preprocessing/_data.py:3085: DocTestFailure

@adrinjalali
Copy link
Member

maybe @lorentzenchr would have a better idea how to proceed here.

@lorentzenchr
Copy link
Member

Can we use scipy.stats.yeojohnson (requiring scipy 1.2 should fine) and contribute such fixes upstream in scipy?

@lsorber
Copy link
Author

lsorber commented Apr 27, 2023

I reviewed scipy's implementation and it's very similar to sklearn's implementation in that it is prone to both of the sources of overflow that this PR addresses.

I am open to applying the improvements of this PR to scipy and then updating this PR to use scipy's implementation. Are you asking me to do this @lorentzenchr? Should we check with the scipy maintainers whether they are open to such a change? The new regularisation term is non-standard and so may meet some resistance.

@adrinjalali
Copy link
Member

I think the idea is to not have our own implementation, and use scipy's when available, and suggest this fix on the scipy side.

@lorentzenchr
Copy link
Member

I opened #26308 to rely on scipy.stats.yeojohnson.
I think the improvement of this PR should be contributed upstream in scipy. Therefore I close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants