FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

thomasjpfan · 2021-08-01T22:58:55Z

Reference Issues/PRs

Fixes #14959
Closes #15385 (superseeds)

What does this implement/fix? Explain your changes.

~~Checks for the value for the neg-log-likelihood and issue a warning. We have common test and other test that input this type of data so I think a warning is okay.~~

EDIT: this PR now rejects the problematic lambda that would lead to constant transformed data causing the problem.

Any other comments?

The original issue has been bumped up a few times, lets see if we can resolve it for 1.0.
CC @NicolasHug

NicolasHug

Thanks @thomasjpfan for the PR

With the proposed changes, we'll still get the ZeroDivisionWarnings, right? I'm wondering if erroring would make more sense.

Also, this new logic seems to assume that we can only get an infinite lambda when we have x_trans full of zeros. Are we sure about this?

sklearn/preprocessing/_data.py

ogrisel

Here are a few comments.

In retrospect, I wonder if we should set lmbda == np.nan when the brent optimization finds an infinite nll instead of using an arbitrary lambda value that depends on optimizer details.

For those columns with np.nan lambdas we could then skip the Yeo-Johnson transformation (but keep the subsequent StandardScaler when standardize=True). StandardScaler should be able to deal with near constant feature in numerically principled way.

We could also have a constructor parameter to silence the warning since in practice, many users might find it useful to just center constant or near constant features.

doc/whats_new/v1.0.rst

sklearn/preprocessing/_data.py

sklearn/preprocessing/tests/test_data.py

sklearn/preprocessing/_data.py

ogrisel · 2021-08-04T15:04:54Z

Actually there are two different cases to handle:

First case is: the features is really not constant but also significantly non-Gaussian before the transform, but constant for some values of lambda explored by the optimizers. This is the case for the original dataset reported by OP in PowerTransformer 'divide by zero encountered in log' + proposed fix #14959 (comment) and maybe the proposed solution is actually good in this case: reject those solution by returning np.inf instead of -np.inf. Based on the histograms reported in the description of PowerTransformer 'divide by zero encountered in log' + proposed fix #14959 it seems to yield valid, non-constant results that looks approximately Gaussian after the transform.
Second case: features where x_trans has zero variance for all possible values of lambda explored by brent, (most probably because the input feature is constant or near constant anyway). Then we could set lambda to np.nan and only standardize those columns, probably with a silence-able warning message.

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

thomasjpfan · 2021-08-16T21:53:52Z

Since there are two cases here, I updated this PR to resolve the first case: significantly non-Gaussian data where some lambdas result in constant transformed data.

I will followup with a PR for the second case:

features where x_trans has zero variance for all possible values of lambda explored by brent,

ogrisel

LGTM, thanks @thomasjpfan. @NicolasHug do you agree with the analysis and this 2-step strategy?

sklearn/preprocessing/tests/test_data.py

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

ogrisel · 2021-12-05T19:36:53Z

Still +1 for this PR. Maybe ping @adrinjalali @glemaitre @jnothman for a second review?

jnothman

Looks great apart from wanting confidence that try-except is the best way to do it

sklearn/preprocessing/_data.py

jeremiedbb · 2022-03-14T23:37:50Z

sklearn/preprocessing/_data.py

+            # Reject transformed data that is constant
+            if x_trans_var < x_tiny:
+                return np.inf


tiny is really small and will probably not catch what should be considered constant data in some cases.
What do you think about using _is_constant_feature (maybe changing the name) which is designed for that ?

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

It turns out that np.log can handle values below tiny as well:

import numpy as np np.log(np.finfo(np.float64).tiny * 1e-15) # -742.83

np.log even works down to the smallest subnormal (smallest_subnormal was introduced in NumPy 1.22)

import numpy as np finfo = np.finfo(np.float64) finfo.smallest_subnormal # 5e-324 np.log(finfo.smallest_subnormal) # -744.44 np.log(finfo.smallest_subnormal * 0.5) # -inf

This x_trans_var < x_tiny is used because of a valid threading concern raised here: #20653 (comment) Originally, I caught the runtime warning, but catch_warning is not thread-safe.

I updated the comment. This is more for detecting the runtime warning in np.log in the line below. As long as the np.log(variance) can be computed the likelihood can be computed as well.

What I meant is that even if computable, its value would be meaningless. The variance is so small that it lies within the theoretical error bounds, meaning it's undistinguishable from a zero variance.

However, this situation should not appear very often, and even if it does, this lambda would not be the argmin anyway (unless all lambdas lead to constant x), so I'm ok with the tiny solution as well.

sklearn/preprocessing/_data.py

sklearn/preprocessing/tests/test_data.py

jeremiedbb

LGTM

@ogrisel

This fix is not the same as the initial one. @ogrisel you might want to take another look

ogrisel

Still +1. We might want to add proper support for float32 input data later.

jeremiedbb · 2022-03-24T15:43:39Z

We might want to add proper support for float32 input data later.

numpy.var always uses float64 accumulator

ogrisel · 2022-03-24T15:53:16Z

Ok but we will probably need to increase the test coverage with the global_dtype fixture.

jeremiedbb · 2022-03-24T16:42:15Z

Thanks @thomasjpfan !

…ssian data (scikit-learn#20653) Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

mdhaber · 2022-04-10T15:53:06Z

@thomasjpfan would you be willing to submit this to SciPy, too, to resolve scipy/scipy#10821?

FIX Adds a warning for PowerTransform and pathological data

f099e51

github-actions bot added the module:preprocessing label Aug 1, 2021

thomasjpfan added 2 commits August 1, 2021 19:00

DOC Adds whats new

2d04749

CLN Slightly cleaner

5c9a9c5

NicolasHug reviewed Aug 2, 2021

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

ogrisel reviewed Aug 4, 2021

View reviewed changes

ogrisel mentioned this pull request Aug 6, 2021

PowerTransformer 'divide by zero encountered in log' + proposed fix #14959

Closed

thomasjpfan and others added 7 commits August 8, 2021 21:40

Apply suggestions from code review

8c251b5

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

Merge remote-tracking branch 'upstream/main' into power_error

05ba808

ENH Catches the runtime warning

8c2a2fd

REV Removes constant test

366ed84

REV Removes constant test

be6331f

REV Removes check for zero variance data

1161c9b

Merge remote-tracking branch 'upstream/main' into power_error

51afd23

thomasjpfan changed the title ~~FIX Adds a warning for PowerTransform and pathological data~~ FIX Adds a warning for PowerTransformer and significantly non-Gaussian data Aug 16, 2021

DOC Adds link to issue

4343df2

DOC Update code comment

c0dad4d

ogrisel previously approved these changes Aug 20, 2021

View reviewed changes

ogrisel reviewed Aug 20, 2021

View reviewed changes

sklearn/preprocessing/tests/test_data.py Show resolved Hide resolved

thomasjpfan and others added 4 commits August 20, 2021 11:50

Update sklearn/preprocessing/tests/test_data.py

8f628da

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

Black

366b23d

Merge remote-tracking branch 'upstream/main' into power_error

c5b887f

DOC move whats new to 1.1

fb8ed10

ogrisel changed the title ~~FIX Adds a warning for PowerTransformer and significantly non-Gaussian data~~ FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data Dec 5, 2021

ogrisel added the Waiting for Reviewer label Dec 5, 2021

jnothman reviewed Dec 6, 2021

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into power_error

8856a84

CLN Address threading concern

f128af9

thomasjpfan force-pushed the power_error branch from 9d7ca4e to f128af9 Compare December 6, 2021 16:16

thomasjpfan added 3 commits March 9, 2022 22:58

Merge remote-tracking branch 'upstream/main' into power_error

a80915a

TST Show warnings

069d2bd

DOC Remove unneeded whats new

8fc654a

thomasjpfan added this to the 1.1 milestone Mar 10, 2022

DOC Adds comment

6ef6371

jeremiedbb reviewed Mar 14, 2022

View reviewed changes

CLN Use float64

cfaf66d

jeremiedbb approved these changes Mar 17, 2022

View reviewed changes

Merge branch 'main' into power_error

68721df

ogrisel approved these changes Mar 24, 2022

View reviewed changes

jeremiedbb merged commit c3f81c1 into scikit-learn:main Mar 24, 2022

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gau…

200466f

…ssian data (scikit-learn#20653) Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

mdhaber mentioned this pull request Apr 17, 2022

Errors with the Yeo-Johnson Transform that also Appear in Scikit-Learn scipy/scipy#10821

Closed

glemaitre mentioned this pull request May 10, 2022

Yeo-Johnson Power Transformer gives Numpy warning (and raises scipy.optimize._optimize.BracketError in some cases) #23319

Closed

Uh oh!

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

FIX PowerTransformer Yeo-Johnson auto-tuning on significantly non-Gaussian data #20653

Uh oh!

Conversation

thomasjpfan commented Aug 1, 2021 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Dec 5, 2021

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb Mar 14, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Mar 24, 2022

Uh oh!

ogrisel commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Mar 24, 2022

Uh oh!

mdhaber commented Apr 10, 2022

Uh oh!

Uh oh!

thomasjpfan commented Aug 1, 2021 •

edited by ogrisel

Loading

ogrisel commented Aug 4, 2021 •

edited

Loading

thomasjpfan commented Aug 16, 2021 •

edited

Loading

thomasjpfan Mar 17, 2022 •

edited

Loading

jeremiedbb Mar 17, 2022 •

edited

Loading

ogrisel commented Mar 24, 2022 •

edited

Loading