-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan for the PR
With the proposed changes, we'll still get the ZeroDivisionWarnings, right? I'm wondering if erroring would make more sense.
Also, this new logic seems to assume that we can only get an infinite lambda when we have x_trans full of zeros. Are we sure about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few comments.
In retrospect, I wonder if we should set lmbda == np.nan
when the brent optimization finds an infinite nll instead of using an arbitrary lambda value that depends on optimizer details.
For those columns with np.nan lambdas we could then skip the Yeo-Johnson transformation (but keep the subsequent StandardScaler
when standardize=True
). StandardScaler
should be able to deal with near constant feature in numerically principled way.
We could also have a constructor parameter to silence the warning since in practice, many users might find it useful to just center constant or near constant features.
Actually there are two different cases to handle:
|
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Since there are two cases here, I updated this PR to resolve the first case: significantly non-Gaussian data where some lambdas result in constant transformed data. I will followup with a PR for the second case:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @thomasjpfan. @NicolasHug do you agree with the analysis and this 2-step strategy?
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Still +1 for this PR. Maybe ping @adrinjalali @glemaitre @jnothman for a second review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great apart from wanting confidence that try-except is the best way to do it
9d7ca4e
to
f128af9
Compare
sklearn/preprocessing/_data.py
Outdated
# Reject transformed data that is constant | ||
if x_trans_var < x_tiny: | ||
return np.inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny is really small and will probably not catch what should be considered constant data in some cases.
What do you think about using _is_constant_feature
(maybe changing the name) which is designed for that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment. This is more for detecting the runtime warning in np.log
in the line below. As long as the np.log(variance)
can be computed the likelihood can be computed as well.
It turns out that np.log
can handle values below tiny
as well:
import numpy as np
np.log(np.finfo(np.float64).tiny * 1e-15)
# -742.83
np.log
even works down to the smallest subnormal (smallest_subnormal
was introduced in NumPy 1.22)
import numpy as np
finfo = np.finfo(np.float64)
finfo.smallest_subnormal
# 5e-324
np.log(finfo.smallest_subnormal)
# -744.44
np.log(finfo.smallest_subnormal * 0.5)
# -inf
This x_trans_var < x_tiny
is used because of a valid threading concern raised here: #20653 (comment) Originally, I caught the runtime warning, but catch_warning is not thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.
What I meant is that even if computable, its value would be meaningless. The variance is so small that it lies within the theoretical error bounds, meaning it's undistinguishable from a zero variance.
However, this situation should not appear very often, and even if it does, this lambda would not be the argmin anyway (unless all lambdas lead to constant x), so I'm ok with the tiny
solution as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This fix is not the same as the initial one. @ogrisel you might want to take another look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still +1. We might want to add proper support for float32 input data later.
numpy.var always uses float64 accumulator |
Ok but we will probably need to increase the test coverage with the |
Thanks @thomasjpfan ! |
…ssian data (scikit-learn#20653) Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
@thomasjpfan would you be willing to submit this to SciPy, too, to resolve scipy/scipy#10821? |
Reference Issues/PRs
Fixes #14959
Closes #15385 (superseeds)
What does this implement/fix? Explain your changes.
Checks for the value for the neg-log-likelihood and issue a warning. We have common test and other test that input this type of data so I think a warning is okay.EDIT: this PR now rejects the problematic lambda that would lead to constant transformed data causing the problem.
Any other comments?
The original issue has been bumped up a few times, lets see if we can resolve it for 1.0.
CC @NicolasHug