-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
MRG Add Warning for NaNs in Yeo-Johnson Inverse Transform with Extremely Skewed Data #29307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @ogrisel, Could you review the solution? Thanks! |
I believe this change doesn't need an update in the Changelog. |
Hi @ogrisel, @glemaitre, and @thomasjpfan, Could someone please review this? I believe this is an easy win. Best Regards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with showing a warning here.
sklearn/preprocessing/_data.py
Outdated
@@ -3428,7 +3428,8 @@ def inverse_transform(self, X): | |||
"yeo-johnson": self._yeo_johnson_inverse_transform, | |||
}[self.method] | |||
for i, lmbda in enumerate(self.lambdas_): | |||
with np.errstate(invalid="ignore"): # hide NaN warnings | |||
# raise RuntimeWarning if return NaNs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it'll be useful to show a more informative warning for this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can.
The current message (from Numpy, I believe) is:
"invalid value encountered in power"
I was thinking of something like this:
"Some values in the inverse-transformed data are NaN (column 0), which may be due to extreme skewness in the data. Consider addressing outliers before applying the transformation."
The code could be like this:
for i, lmbda in enumerate(self.lambdas_):
with np.errstate(invalid="ignore"):
X[:, i] = inv_fun(X[:, i], lmbda)
if np.isnan(X[:, i]).any():
# raise UserWarning if return NaNs
warnings.warn(
f"""Some values in the inverse-transformed data are NaN (column {i}), which
may be due to extreme skewness in the data. Consider addressing outliers
before applying the transformation.""",
UserWarning
)```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we only check the nan
values when the exception is raised to avoid a scan on the data? A try/except might be better then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
for i, lmbda in enumerate(self.lambdas_):
with np.errstate(invalid='warn'):
with warnings.catch_warnings(record=True) as captured_warnings:
X[:, i] = inv_fun(X[:, i], lmbda)
if captured_warnings and np.isnan(X[:, i]).any():
warnings.warn(
f"""Some values in column {i} of the inverse-transformed data are NaN. This may be due to
extreme skewness or outliers in the data for this column. Consider addressing these issues,
such as removing or imputing outliers, before applying the transformation.""",
UserWarning
)
I believe the catch_warnings would fit better than the try-except because we want the results to be returned. I have also included the column number to help users identify where the issue is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i, lmda in enumerate(self.lambda_):
with np.errstate(invalid="raise"):
try:
X[:, i] = inv_fun(X[:, i], lmbda)
except Exception:
if np.isnan(X[:, i]).any():
warning.warn(...)
I think that we could still specialize the type of the Exception
depending on what is raised. Also, is there any other "invalid" exception that would be raised or in other words, do we really need the check for the nan
or can we just always raise the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre, I'm avoiding try-except because the code
X[:, i] = inv_fun(X[:, i], lmbda)
would have to execute twice, such as in a finally
block. On the other hand, capturing and addressing the warnings ensures that the inv_fun
function runs only once.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to document the behaviour by adding a fragment in doc/whats_new/upcoming_changes
.
I'm not against adding the warning but I think that we can avoid to make a pass checking for the nan when it is not necessary.
Would using scipy Yeo-Johnson implementation avoid this issue, see #23319 (comment)? I think this would be the best solution if that's the case ... |
True. I used scipy==1.8.1 and numpy==1.23.3. import numpy as np
from scipy.stats import yeojohnson
x = np.array([1, 1, 1e10]) # Extreme skewness
x_transformed, lmbda = yeojohnson(x)
if lmbda != 0:
x_inv = np.power(x_transformed * lmbda + 1, 1 / lmbda) - 1
else:
x_inv = np.exp(x_transformed) - 1
nan_detected = np.isnan(x_inv).any()
print(f"Original data: {x}")
print(f"Transformed data: {x_transformed}")
print(f"Lambda: {lmbda}")
print(f"Inverse-transformed data: {x_inv}")
print(f"NaN detected in inverse-transformed data: {nan_detected}")
|
I thought about it a bit more and I don't think using scipy is relevant, because scipy would be used for the |
@lesteve, It looks like SciPy correctly handles the issue based on the test case. However, there is still a bug, and I am not sure when the current code will be replaced by the SciPy version. This PR would be helpful until that happens. |
doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enhancement.rst
Outdated
Show resolved
Hide resolved
sklearn/preprocessing/_data.py
Outdated
with warnings.catch_warnings(record=True) as captured_warnings: | ||
with np.errstate(invalid="warn"): | ||
X[:, i] = inv_fun(X[:, i], lmbda) | ||
if captured_warnings and np.isnan(X[:, i]).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running np.isnan(X[:, i])
creates a NumPy array of shape (n_sample, )
, which adds memory overhead.
I know it's more brittle, but can we check that the NumPy warning message contains "invalid value encountered in power" and then raise our own warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a warning called TransformationFailedWarning and used it for a more straightforward check, as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being unclear. I mostly wanted to convert np.isnan(X[:, i]).any()
to "invalid value encountered in power" in warning_message
and leave everything else the same.
Can you revert the new warning object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…ancement.rst Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Let’s consider the transformation y=YeoJohnson(x, lmd). For x, any value is valid, but y is bounded within a specific range (see details here). If the input to the inverse transform falls outside this range, the output will be A more informative warning message could be: Input value for the inverse transformation falls outside the valid range of the PowerTransformer output. As a result, the operation will return This warning is more descriptive than a generic message about numerical issues or extremely skewed data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR needs a to merge with the main branch.
doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enhancement.rst
Outdated
Show resolved
Hide resolved
doc/whats_new/upcoming_changes/sklearn.preprocessing/29307.enhancement.rst
Outdated
Show resolved
Hide resolved
…ancement.rst Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…ancement.rst Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
- The :class:`preprocessing.PowerTransformer` now returns a warning | ||
when NaN values are encountered in the inverse transform, `inverse_transform`, typically | ||
caused by extremely skewed data. | ||
By :user:Roberto Mourao <maf-rnmourao> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maf-rnmourao the rst syntax is not quite right, would you be kind enough to open a PR with the following change 🙏:
By :user:`Roberto Mourao <maf-rnmourao>`
From the dev website:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference Issues/PRs
Fixes #28946
This update triggers a RuntimeWarning when the issue described occurs.
What does this implement/fix? Explain your changes.
Any other comments?