Skip to content

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Mar 24, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Aug 1, 2021

Reference Issues/PRs

Fixes #14959
Closes #15385 (superseeds)

What does this implement/fix? Explain your changes.

Checks for the value for the neg-log-likelihood and issue a warning. We have common test and other test that input this type of data so I think a warning is okay.

EDIT: this PR now rejects the problematic lambda that would lead to constant transformed data causing the problem.

Any other comments?

The original issue has been bumped up a few times, lets see if we can resolve it for 1.0.
CC @NicolasHug

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan for the PR

With the proposed changes, we'll still get the ZeroDivisionWarnings, right? I'm wondering if erroring would make more sense.

Also, this new logic seems to assume that we can only get an infinite lambda when we have x_trans full of zeros. Are we sure about this?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few comments.

In retrospect, I wonder if we should set lmbda == np.nan when the brent optimization finds an infinite nll instead of using an arbitrary lambda value that depends on optimizer details.

For those columns with np.nan lambdas we could then skip the Yeo-Johnson transformation (but keep the subsequent StandardScaler when standardize=True). StandardScaler should be able to deal with near constant feature in numerically principled way.

We could also have a constructor parameter to silence the warning since in practice, many users might find it useful to just center constant or near constant features.

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2021

Actually there are two different cases to handle:

  • First case is: the features is really not constant but also significantly non-Gaussian before the transform, but constant for some values of lambda explored by the optimizers. This is the case for the original dataset reported by OP in PowerTransformer 'divide by zero encountered in log' + proposed fix #14959 (comment) and maybe the proposed solution is actually good in this case: reject those solution by returning np.inf instead of -np.inf. Based on the histograms reported in the description of PowerTransformer 'divide by zero encountered in log' + proposed fix #14959 it seems to yield valid, non-constant results that looks approximately Gaussian after the transform.

  • Second case: features where x_trans has zero variance for all possible values of lambda explored by brent, (most probably because the input feature is constant or near constant anyway). Then we could set lambda to np.nan and only standardize those columns, probably with a silence-able warning message.

@thomasjpfan thomasjpfan changed the title FIX Adds a warning for PowerTransform and pathological data FIX Adds a warning for PowerTransformer and significantly non-Gaussian data Aug 16, 2021
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Aug 16, 2021

Since there are two cases here, I updated this PR to resolve the first case: significantly non-Gaussian data where some lambdas result in constant transformed data.

I will followup with a PR for the second case:

features where x_trans has zero variance for all possible values of lambda explored by brent,

ogrisel
ogrisel previously approved these changes Aug 20, 2021
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @thomasjpfan. @NicolasHug do you agree with the analysis and this 2-step strategy?

@ogrisel ogrisel changed the title FIX Adds a warning for PowerTransformer and significantly non-Gaussian data FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data Dec 5, 2021
@ogrisel
Copy link
Member

ogrisel commented Dec 5, 2021

Still +1 for this PR. Maybe ping @adrinjalali @glemaitre @jnothman for a second review?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great apart from wanting confidence that try-except is the best way to do it

@thomasjpfan thomasjpfan added this to the 1.1 milestone Mar 10, 2022
Comment on lines 3248 to 3250
# Reject transformed data that is constant
if x_trans_var < x_tiny:
return np.inf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny is really small and will probably not catch what should be considered constant data in some cases.
What do you think about using _is_constant_feature (maybe changing the name) which is designed for that ?

Copy link
Member Author

@thomasjpfan thomasjpfan Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

It turns out that np.log can handle values below tiny as well:

import numpy as np

np.log(np.finfo(np.float64).tiny * 1e-15)
# -742.83

np.log even works down to the smallest subnormal (smallest_subnormal was introduced in NumPy 1.22)

import numpy as np

finfo = np.finfo(np.float64)
finfo.smallest_subnormal
# 5e-324

np.log(finfo.smallest_subnormal)
# -744.44

np.log(finfo.smallest_subnormal * 0.5)
# -inf

This x_trans_var < x_tiny is used because of a valid threading concern raised here: #20653 (comment) Originally, I caught the runtime warning, but catch_warning is not thread-safe.

Copy link
Member

@jeremiedbb jeremiedbb Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

What I meant is that even if computable, its value would be meaningless. The variance is so small that it lies within the theoretical error bounds, meaning it's undistinguishable from a zero variance.

However, this situation should not appear very often, and even if it does, this lambda would not be the argmin anyway (unless all lambdas lead to constant x), so I'm ok with the tiny solution as well.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb jeremiedbb dismissed ogrisel’s stale review March 18, 2022 17:15

This fix is not the same as the initial one. @ogrisel you might want to take another look

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still +1. We might want to add proper support for float32 input data later.

@jeremiedbb
Copy link
Member

We might want to add proper support for float32 input data later.

numpy.var always uses float64 accumulator

@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2022

Ok but we will probably need to increase the test coverage with the global_dtype fixture.

@jeremiedbb jeremiedbb merged commit c3f81c1 into scikit-learn:main Mar 24, 2022
@jeremiedbb
Copy link
Member

Thanks @thomasjpfan !

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
…ssian data (scikit-learn#20653)

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
@mdhaber
Copy link
Contributor

mdhaber commented Apr 10, 2022

@thomasjpfan would you be willing to submit this to SciPy, too, to resolve scipy/scipy#10821?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PowerTransformer 'divide by zero encountered in log' + proposed fix
6 participants