Skip to content

Conversation

maf-rnmourao
Copy link
Contributor

@maf-rnmourao maf-rnmourao commented Jun 19, 2024

Reference Issues/PRs

Fixes #28946

This update triggers a RuntimeWarning when the issue described occurs.

What does this implement/fix? Explain your changes.

  • Modifies np.errstate(invalid='ignore') to np.errstate(invalid='warn') in the inverse_transform method of the PowerTransformer class (preprocessing/_data.py).
  • Adds a test case for extremely skewed data in preprocessing/tests/test_data.py.

Any other comments?

Copy link

github-actions bot commented Jun 19, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 85b2484. Link to the linter CI: here

@maf-rnmourao
Copy link
Contributor Author

Hi @ogrisel, Could you review the solution? Thanks!

@maf-rnmourao maf-rnmourao changed the title [MRG] Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data Jun 25, 2024
@maf-rnmourao maf-rnmourao changed the title MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data FIX Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data Jun 25, 2024
@maf-rnmourao maf-rnmourao changed the title FIX Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data Jun 25, 2024
@maf-rnmourao
Copy link
Contributor Author

I believe this change doesn't need an update in the Changelog.

@rnmourao
Copy link
Contributor

rnmourao commented Nov 23, 2024

Hi @ogrisel, @glemaitre, and @thomasjpfan,

Could someone please review this? I believe this is an easy win.

Best Regards,
Roberto

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with showing a warning here.

@@ -3428,7 +3428,8 @@ def inverse_transform(self, X):
"yeo-johnson": self._yeo_johnson_inverse_transform,
}[self.method]
for i, lmbda in enumerate(self.lambdas_):
with np.errstate(invalid="ignore"): # hide NaN warnings
# raise RuntimeWarning if return NaNs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'll be useful to show a more informative warning for this context?

Copy link
Contributor Author

@maf-rnmourao maf-rnmourao Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can.

The current message (from Numpy, I believe) is:

"invalid value encountered in power"

I was thinking of something like this:

"Some values in the inverse-transformed data are NaN (column 0), which may be due to extreme skewness in the data. Consider addressing outliers before applying the transformation."

The code could be like this:

for i, lmbda in enumerate(self.lambdas_):
    with np.errstate(invalid="ignore"):
        X[:, i] = inv_fun(X[:, i], lmbda)
        if np.isnan(X[:, i]).any():
            # raise UserWarning if return NaNs
            warnings.warn(
                f"""Some values in the inverse-transformed data are NaN (column {i}), which
                may be due to extreme skewness in the data. Consider addressing outliers 
                before applying the transformation.""",
                UserWarning
            )```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we only check the nan values when the exception is raised to avoid a scan on the data? A try/except might be better then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

        for i, lmbda in enumerate(self.lambdas_):
            with np.errstate(invalid='warn'):
                with warnings.catch_warnings(record=True) as captured_warnings:
                    X[:, i] = inv_fun(X[:, i], lmbda)
            if captured_warnings and np.isnan(X[:, i]).any():
                    warnings.warn(
                        f"""Some values in column {i} of the inverse-transformed data are NaN. This may be due to 
                        extreme skewness or outliers in the data for this column. Consider addressing these issues, 
                        such as removing or imputing outliers, before applying the transformation.""",
                        UserWarning
                    )            

I believe the catch_warnings would fit better than the try-except because we want the results to be returned. I have also included the column number to help users identify where the issue is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i, lmda in enumerate(self.lambda_):
    with np.errstate(invalid="raise"):
        try:
            X[:, i] = inv_fun(X[:, i], lmbda)
        except Exception:
            if np.isnan(X[:, i]).any():
                warning.warn(...)

I think that we could still specialize the type of the Exception depending on what is raised. Also, is there any other "invalid" exception that would be raised or in other words, do we really need the check for the nan or can we just always raise the warning?

Copy link
Contributor Author

@maf-rnmourao maf-rnmourao Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre, I'm avoiding try-except because the code

X[:, i] = inv_fun(X[:, i], lmbda)

would have to execute twice, such as in a finally block. On the other hand, capturing and addressing the warnings ensures that the inv_fun function runs only once.

maf-rnmourao and others added 4 commits November 23, 2024 21:48
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to document the behaviour by adding a fragment in doc/whats_new/upcoming_changes.

I'm not against adding the warning but I think that we can avoid to make a pass checking for the nan when it is not necessary.

@lesteve
Copy link
Member

lesteve commented Dec 3, 2024

Would using scipy Yeo-Johnson implementation avoid this issue, see #23319 (comment)? I think this would be the best solution if that's the case ...

@rnmourao
Copy link
Contributor

rnmourao commented Dec 3, 2024

Would using scipy Yeo-Johnson implementation avoid this issue, see #23319 (comment)? I think this would be the best solution if that's the case ...

True. I used scipy==1.8.1 and numpy==1.23.3.

import numpy as np
from scipy.stats import yeojohnson

x = np.array([1, 1, 1e10])  # Extreme skewness

x_transformed, lmbda = yeojohnson(x)

if lmbda != 0:
    x_inv = np.power(x_transformed * lmbda + 1, 1 / lmbda) - 1
else:
    x_inv = np.exp(x_transformed) - 1

nan_detected = np.isnan(x_inv).any()

print(f"Original data: {x}")
print(f"Transformed data: {x_transformed}")
print(f"Lambda: {lmbda}")
print(f"Inverse-transformed data: {x_inv}")
print(f"NaN detected in inverse-transformed data: {nan_detected}")
Original data: [1.e+00 1.e+00 1.e+10]
Transformed data: [0.67053519 0.67053519 9.25819719]
Lambda: -0.0962322261004418
Inverse-transformed data: [1.e+00 1.e+00 1.e+10]
NaN detected in inverse-transformed data: False

@lesteve
Copy link
Member

lesteve commented Dec 3, 2024

I thought about it a bit more and I don't think using scipy is relevant, because scipy would be used for the transform whereas this problem happens in inverse_transform ...

@maf-rnmourao
Copy link
Contributor Author

I thought about it a bit more and I don't think using scipy is relevant, because scipy would be used for the transform whereas this problem happens in inverse_transform ...

@lesteve, It looks like SciPy correctly handles the issue based on the test case. However, there is still a bug, and I am not sure when the current code will be replaced by the SciPy version. This PR would be helpful until that happens.

with warnings.catch_warnings(record=True) as captured_warnings:
with np.errstate(invalid="warn"):
X[:, i] = inv_fun(X[:, i], lmbda)
if captured_warnings and np.isnan(X[:, i]).any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running np.isnan(X[:, i]) creates a NumPy array of shape (n_sample, ), which adds memory overhead.

I know it's more brittle, but can we check that the NumPy warning message contains "invalid value encountered in power" and then raise our own warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a warning called TransformationFailedWarning and used it for a more straightforward check, as you suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being unclear. I mostly wanted to convert np.isnan(X[:, i]).any() to "invalid value encountered in power" in warning_message and leave everything else the same.

Can you revert the new warning object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@xuefeng-xu
Copy link
Contributor

Let’s consider the transformation y=YeoJohnson(x, lmd). For x, any value is valid, but y is bounded within a specific range (see details here). If the input to the inverse transform falls outside this range, the output will be NaN.

A more informative warning message could be:

Input value for the inverse transformation falls outside the valid range of the PowerTransformer output. As a result, the operation will return NaN. Ensure that input values for the inverse transformation lie within the bounded range to avoid this issue.

This warning is more descriptive than a generic message about numerical issues or extremely skewed data.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs a to merge with the main branch.

@lorentzenchr lorentzenchr added this to the 1.8 milestone Sep 2, 2025
maf-rnmourao and others added 7 commits September 2, 2025 13:18
…ancement.rst

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…ancement.rst

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@lorentzenchr lorentzenchr merged commit 96f48da into scikit-learn:main Sep 2, 2025
36 checks passed
- The :class:`preprocessing.PowerTransformer` now returns a warning
when NaN values are encountered in the inverse transform, `inverse_transform`, typically
caused by extremely skewed data.
By :user:Roberto Mourao <maf-rnmourao>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maf-rnmourao the rst syntax is not quite right, would you be kind enough to open a PR with the following change 🙏:

By :user:`Roberto Mourao <maf-rnmourao>`

From the dev website:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lesteve ,

Here is the PR: #32093

Best Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Yeo-Johnson inverse_transform fails silently on extreme skew data
7 participants