MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data #29307

maf-rnmourao · 2024-06-19T18:29:14Z

Reference Issues/PRs

Fixes #28946

This update triggers a RuntimeWarning when the issue described occurs.

What does this implement/fix? Explain your changes.

Modifies np.errstate(invalid='ignore') to np.errstate(invalid='warn') in the inverse_transform method of the PowerTransformer class (preprocessing/_data.py).
Adds a test case for extremely skewed data in preprocessing/tests/test_data.py.

Any other comments?

github-actions · 2024-06-19T18:30:39Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 85b2484. Link to the linter CI: here}

maf-rnmourao · 2024-06-21T08:22:06Z

Hi @ogrisel, Could you review the solution? Thanks!

maf-rnmourao · 2024-07-11T08:41:39Z

I believe this change doesn't need an update in the Changelog.

rnmourao · 2024-11-23T09:12:36Z

Hi @ogrisel, @glemaitre, and @thomasjpfan,

Could someone please review this? I believe this is an easy win.

Best Regards,
Roberto

sklearn/preprocessing/tests/test_data.py

thomasjpfan

I'm okay with showing a warning here.

sklearn/preprocessing/tests/test_data.py

thomasjpfan · 2024-11-23T17:38:20Z

sklearn/preprocessing/_data.py

@@ -3428,7 +3428,8 @@ def inverse_transform(self, X):
            "yeo-johnson": self._yeo_johnson_inverse_transform,
        }[self.method]
        for i, lmbda in enumerate(self.lambdas_):
-            with np.errstate(invalid="ignore"):  # hide NaN warnings
+            # raise RuntimeWarning if return NaNs


Do you think it'll be useful to show a more informative warning for this context?

Yes, we can.

The current message (from Numpy, I believe) is:

"invalid value encountered in power"

I was thinking of something like this:

"Some values in the inverse-transformed data are NaN (column 0), which may be due to extreme skewness in the data. Consider addressing outliers before applying the transformation."

The code could be like this:

for i, lmbda in enumerate(self.lambdas_): with np.errstate(invalid="ignore"): X[:, i] = inv_fun(X[:, i], lmbda) if np.isnan(X[:, i]).any(): # raise UserWarning if return NaNs warnings.warn( f"""Some values in the inverse-transformed data are NaN (column {i}), which may be due to extreme skewness in the data. Consider addressing outliers before applying the transformation.""", UserWarning )```

Could we only check the nan values when the exception is raised to avoid a scan on the data? A try/except might be better then.

Something like this?

for i, lmbda in enumerate(self.lambdas_): with np.errstate(invalid='warn'): with warnings.catch_warnings(record=True) as captured_warnings: X[:, i] = inv_fun(X[:, i], lmbda) if captured_warnings and np.isnan(X[:, i]).any(): warnings.warn( f"""Some values in column {i} of the inverse-transformed data are NaN. This may be due to extreme skewness or outliers in the data for this column. Consider addressing these issues, such as removing or imputing outliers, before applying the transformation.""", UserWarning )

I believe the catch_warnings would fit better than the try-except because we want the results to be returned. I have also included the column number to help users identify where the issue is.

for i, lmda in enumerate(self.lambda_): with np.errstate(invalid="raise"): try: X[:, i] = inv_fun(X[:, i], lmbda) except Exception: if np.isnan(X[:, i]).any(): warning.warn(...)

I think that we could still specialize the type of the Exception depending on what is raised. Also, is there any other "invalid" exception that would be raised or in other words, do we really need the check for the nan or can we just always raise the warning?

@glemaitre, I'm avoiding try-except because the code

X[:, i] = inv_fun(X[:, i], lmbda)

would have to execute twice, such as in a finally block. On the other hand, capturing and addressing the warnings ensures that the inv_fun function runs only once.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

glemaitre

We will need to document the behaviour by adding a fragment in doc/whats_new/upcoming_changes.

I'm not against adding the warning but I think that we can avoid to make a pass checking for the nan when it is not necessary.

lesteve · 2024-12-03T10:59:03Z

Would using scipy Yeo-Johnson implementation avoid this issue, see #23319 (comment)? I think this would be the best solution if that's the case ...

rnmourao · 2024-12-03T11:26:45Z

Would using scipy Yeo-Johnson implementation avoid this issue, see #23319 (comment)? I think this would be the best solution if that's the case ...

True. I used scipy==1.8.1 and numpy==1.23.3.

import numpy as np
from scipy.stats import yeojohnson

x = np.array([1, 1, 1e10])  # Extreme skewness

x_transformed, lmbda = yeojohnson(x)

if lmbda != 0:
    x_inv = np.power(x_transformed * lmbda + 1, 1 / lmbda) - 1
else:
    x_inv = np.exp(x_transformed) - 1

nan_detected = np.isnan(x_inv).any()

print(f"Original data: {x}")
print(f"Transformed data: {x_transformed}")
print(f"Lambda: {lmbda}")
print(f"Inverse-transformed data: {x_inv}")
print(f"NaN detected in inverse-transformed data: {nan_detected}")

Original data: [1.e+00 1.e+00 1.e+10]
Transformed data: [0.67053519 0.67053519 9.25819719]
Lambda: -0.0962322261004418
Inverse-transformed data: [1.e+00 1.e+00 1.e+10]
NaN detected in inverse-transformed data: False

lesteve · 2024-12-03T16:19:49Z

I thought about it a bit more and I don't think using scipy is relevant, because scipy would be used for the transform whereas this problem happens in inverse_transform ...

maf-rnmourao · 2024-12-03T17:09:34Z

I thought about it a bit more and I don't think using scipy is relevant, because scipy would be used for the transform whereas this problem happens in inverse_transform ...

@lesteve, It looks like SciPy correctly handles the issue based on the test case. However, there is still a bug, and I am not sure when the current code will be replaced by the SciPy version. This PR would be helpful until that happens.

doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enhancement.rst

thomasjpfan · 2024-12-04T16:01:57Z

sklearn/preprocessing/_data.py

+            with warnings.catch_warnings(record=True) as captured_warnings:
+                with np.errstate(invalid="warn"):
+                    X[:, i] = inv_fun(X[:, i], lmbda)
+            if captured_warnings and np.isnan(X[:, i]).any():


Running np.isnan(X[:, i]) creates a NumPy array of shape (n_sample, ), which adds memory overhead.

I know it's more brittle, but can we check that the NumPy warning message contains "invalid value encountered in power" and then raise our own warning?

I created a warning called TransformationFailedWarning and used it for a more straightforward check, as you suggested.

Sorry for being unclear. I mostly wanted to convert np.isnan(X[:, i]).any() to "invalid value encountered in power" in warning_message and leave everything else the same.

Can you revert the new warning object?

…ancement.rst Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…e transform

sklearn/preprocessing/_data.py

xuefeng-xu · 2025-01-21T10:39:33Z

Let’s consider the transformation y=YeoJohnson(x, lmd). For x, any value is valid, but y is bounded within a specific range (see details here). If the input to the inverse transform falls outside this range, the output will be NaN.

A more informative warning message could be:

Input value for the inverse transformation falls outside the valid range of the PowerTransformer output. As a result, the operation will return NaN. Ensure that input values for the inverse transformation lie within the bounded range to avoid this issue.

This warning is more descriptive than a generic message about numerical issues or extremely skewed data.

lorentzenchr

This PR needs a to merge with the main branch.

doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enhancement.rst

sklearn/preprocessing/_data.py

sklearn/preprocessing/tests/test_data.py

…ancement.rst Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lesteve · 2025-09-03T07:21:22Z

doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enhancement.rst

+- The :class:`preprocessing.PowerTransformer` now returns a warning 
+  when NaN values are encountered in the inverse transform, `inverse_transform`, typically 
+  caused by extremely skewed data.
+  By :user:Roberto Mourao <maf-rnmourao>


@maf-rnmourao the rst syntax is not quite right, would you be kind enough to open a PR with the following change 🙏:

By :user:`Roberto Mourao <maf-rnmourao>`

From the dev website:

Hi @lesteve ,

Here is the PR: #32093

Best Regards,

fix issue 28946

1cf77b0

github-actions bot added the module:preprocessing label Jun 19, 2024

rnmourao and others added 2 commits June 20, 2024 14:21

lint

cbac667

Merge branch 'scikit-learn:main' into issue_28946

8ee0689

maf-rnmourao added 2 commits June 21, 2024 12:22

Merge branch 'main' into issue_28946

26750a6

Merge branch 'main' into issue_28946

019238c

maf-rnmourao changed the title ~~[MRG] Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data~~ MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data Jun 25, 2024

maf-rnmourao changed the title ~~MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data~~ FIX Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data Jun 25, 2024

maf-rnmourao changed the title ~~FIX Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data~~ MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data Jun 25, 2024

maf-rnmourao added 5 commits July 15, 2024 23:31

Merge branch 'main' into issue_28946

0d63296

Merge branch 'main' into issue_28946

9a9b7d8

Merge branch 'main' into issue_28946

0916652

Merge branch 'main' into issue_28946

4919f1a

Merge branch 'scikit-learn:main' into issue_28946

36339e5

thomasjpfan reviewed Nov 23, 2024

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

thomasjpfan reviewed Nov 23, 2024

View reviewed changes

maf-rnmourao and others added 4 commits November 23, 2024 21:48

Update sklearn/preprocessing/tests/test_data.py

47037a8

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/preprocessing/tests/test_data.py

2cb8a3e

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/preprocessing/tests/test_data.py

3d1ded9

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Merge branch 'main' into issue_28946

5b4f6b6

glemaitre reviewed Nov 25, 2024

View reviewed changes

maf-rnmourao and others added 5 commits November 25, 2024 23:09

Merge branch 'main' into issue_28946

15007e2

refined warning message for NaNs in inverse transform

587eb12

linting fixes

01f6aaf

linting fixes

6a98c89

fix whats new number

2552e2a

thomasjpfan reviewed Dec 4, 2024

View reviewed changes

maf-rnmourao and others added 4 commits December 4, 2024 23:54

Update doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enh…

c039ee2

…ancement.rst Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

added TransformationFailedWarning; light check for Yeo-Johnson invers…

cd5d8e4

…e transform

Merge branch 'main' into issue_28946

68dfc84

replaced TransformFailedWarning with UserWarning

5041cf9

thomasjpfan reviewed Dec 5, 2024

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

rnmourao and others added 3 commits December 6, 2024 06:26

checking all warnings

0b6de10

a more elegant test

5f44202

Merge branch 'main' into issue_28946

4814802

thomasjpfan approved these changes Dec 6, 2024

View reviewed changes

Merge branch 'main' into issue_28946

1875cb9

lorentzenchr reviewed Sep 2, 2025

View reviewed changes

lorentzenchr added this to the 1.8 milestone Sep 2, 2025

maf-rnmourao and others added 7 commits September 2, 2025 13:18

Update doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enh…

264c4c4

…ancement.rst Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Update doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enh…

54fcae3

…ancement.rst Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Update sklearn/preprocessing/_data.py

519abe4

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Update sklearn/preprocessing/_data.py

18bdbea

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Update sklearn/preprocessing/tests/test_data.py

fbf5798

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Merge branch 'main' into issue_28946

5b57c0f

lint fixes

85b2484

lorentzenchr merged commit 96f48da into scikit-learn:main Sep 2, 2025
36 checks passed

lesteve reviewed Sep 3, 2025

View reviewed changes

lesteve mentioned this pull request Sep 3, 2025

MNT Skip test relying on np.seterr for Pyodide #32089

Merged

maf-rnmourao mentioned this pull request Sep 3, 2025

DOC Adjust contributor name in What's New documentation #32093

Merged

jeremiedbb mentioned this pull request Sep 3, 2025

Release 1.7.2 #32092

Open

13 tasks

Uh oh!

MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data #29307

MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data #29307

Uh oh!

Conversation

maf-rnmourao commented Jun 19, 2024 • edited by lesteve Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

maf-rnmourao commented Jun 21, 2024

Uh oh!

maf-rnmourao commented Jul 11, 2024

Uh oh!

rnmourao commented Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maf-rnmourao Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maf-rnmourao Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Dec 3, 2024

Uh oh!

rnmourao commented Dec 3, 2024

Uh oh!

lesteve commented Dec 3, 2024

Uh oh!

maf-rnmourao commented Dec 3, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuefeng-xu commented Jan 21, 2025

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maf-rnmourao commented Jun 19, 2024 •

edited by lesteve

Loading

github-actions bot commented Jun 19, 2024 •

edited

Loading

rnmourao commented Nov 23, 2024 •

edited

Loading

maf-rnmourao Nov 23, 2024 •

edited

Loading

maf-rnmourao Nov 25, 2024 •

edited

Loading