[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188

lsorber · 2023-04-15T16:48:20Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR fixes two sources of overflow in the Yeo-Johnson power transform:

RuntimeWarning: overflow encountered in multiply from x_trans_var = x_trans.var()
RuntimeWarning: overflow encountered in power from out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda

The first type of overflow is caused by np.power. This PR mitigates this type of overflow by replacing all instances of np.power with a numerically more robust formulation based on np.exp.

The second type of overflow occurs when the exponents blow up for a marginal gain in log likelihood. This PR mitigates this type of overflow by adding a small regularization term on the exponents.

Non-regression tests for both types of overflow have been added.

adrinjalali

This needs a changelog entry, otherwise LGTM.

lsorber · 2023-04-24T07:07:19Z

Thanks for the review @adrinjalali. I updated this PR with the following changes:

I changed x_trans_exp = np.log(np.abs(x_trans)) to x_trans_exp = np.log(np.abs(x_trans[x_trans != 0])) to avoid RuntimeWarning: divide by zero encountered in log.
I added a changelog entry as requested.
I rebased the PR on the latest upstream main.

adrinjalali · 2023-04-24T08:23:03Z

Hmm, this is changing some values in a docstring we have (hence fail in CI)

adrinjalali · 2023-04-24T15:31:34Z

@lsorber please avoid force pushing so that we can see the history of changes. At the end we squash and merge anyway.

lsorber · 2023-04-25T13:37:30Z

@adrinjalali Apologies, I thought I had rebased on the latest upstream main, but found that wasn't the case and so I did another rebase and force push. I'll avoid doing that from now.

It seems that the doctests fail because they expect to see specific lambda values for the given example. However, because of the new regularisation term on the exponent of the transformed variables, the lamdbas are slightly different from the expected lambdas (see below).

How should we proceed from here?

3076 
3077     Examples
3078     --------
3079     &gt;&gt;&gt; import numpy as np
3080     &gt;&gt;&gt; from sklearn.preprocessing import PowerTransformer
3081     &gt;&gt;&gt; pt = PowerTransformer()
3082     &gt;&gt;&gt; data = [[1, 2], [3, 2], [4, 5]]
3083     &gt;&gt;&gt; print(pt.fit(data))
3084     PowerTransformer()
3085     &gt;&gt;&gt; print(pt.lambdas_)
Expected:
    [ 1.386... -3.100...]
Got:
    [ 1.3761854 -3.091368 ]

/home/vsts/work/1/s/sklearn/preprocessing/_data.py:3085: DocTestFailure

adrinjalali · 2023-04-25T13:45:11Z

maybe @lorentzenchr would have a better idea how to proceed here.

lorentzenchr · 2023-04-25T14:00:27Z

Can we use scipy.stats.yeojohnson (requiring scipy 1.2 should fine) and contribute such fixes upstream in scipy?

lsorber · 2023-04-27T12:35:00Z

I reviewed scipy's implementation and it's very similar to sklearn's implementation in that it is prone to both of the sources of overflow that this PR addresses.

I am open to applying the improvements of this PR to scipy and then updating this PR to use scipy's implementation. Are you asking me to do this @lorentzenchr? Should we check with the scipy maintainers whether they are open to such a change? The new regularisation term is non-standard and so may meet some resistance.

adrinjalali · 2023-04-27T12:54:36Z

I think the idea is to not have our own implementation, and use scipy's when available, and suggest this fix on the scipy side.

lorentzenchr · 2023-04-30T10:52:41Z

I opened #26308 to rely on scipy.stats.yeojohnson.
I think the improvement of this PR should be contributed upstream in scipy. Therefore I close.

github-actions bot added the module:preprocessing label Apr 15, 2023

lsorber mentioned this pull request Apr 15, 2023

Yeo-Johnson Power Transformer gives Numpy warning (and raises scipy.optimize._optimize.BracketError in some cases) #23319

Closed

adrinjalali reviewed Apr 18, 2023

View reviewed changes

lsorber force-pushed the fix-power-transformer-overflows branch 2 times, most recently from b19ab96 to 321f145 Compare April 24, 2023 07:04

lsorber added 3 commits April 24, 2023 11:10

fix: avoid overflow in Yeo-Johnson power transform

cf4402d

fix: avoid divide by zero encountered in log

20f4557

docs: add changelog entry

2261ad3

lsorber force-pushed the fix-power-transformer-overflows branch from 321f145 to 2261ad3 Compare April 24, 2023 09:19

lorentzenchr closed this Apr 30, 2023

rkern mentioned this pull request Apr 30, 2023

BUG: Yeo-Johnson Power Transformer gives Numpy warning scipy/scipy#18389

Closed

mathause mentioned this pull request Aug 12, 2024

more stable yeo johnson roundtrips MESMER-group/mesmer#494

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188

[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188

lsorber commented Apr 15, 2023

adrinjalali left a comment

lsorber commented Apr 24, 2023

adrinjalali commented Apr 24, 2023

adrinjalali commented Apr 24, 2023

lsorber commented Apr 25, 2023

adrinjalali commented Apr 25, 2023

lorentzenchr commented Apr 25, 2023

lsorber commented Apr 27, 2023

adrinjalali commented Apr 27, 2023

lorentzenchr commented Apr 30, 2023

[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188

[MRG] fix: avoid overflow in Yeo-Johnson power transform #26188

Conversation

lsorber commented Apr 15, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

adrinjalali left a comment

Choose a reason for hiding this comment

lsorber commented Apr 24, 2023

adrinjalali commented Apr 24, 2023

adrinjalali commented Apr 24, 2023

lsorber commented Apr 25, 2023

adrinjalali commented Apr 25, 2023

lorentzenchr commented Apr 25, 2023

lsorber commented Apr 27, 2023

adrinjalali commented Apr 27, 2023

lorentzenchr commented Apr 30, 2023