ENH Use scipy.special.inv_boxcox in PowerTransformer #27875

xuefeng-xu · 2023-11-30T06:09:05Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Since we already use scipy.special.boxcox, I think we can use scipy.special.inv_boxcox for the inverse too.
https://github.com/scipy/scipy/blob/fcf7b652bc27e47d215557bda61c84d19adc3aae/scipy/special/_boxcox.pxd#L30-L34

Any other comments?

The Box-Cox transformation.

github-actions · 2023-11-30T06:11:33Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 072028d. Link to the linter CI: here}

xuefeng-xu · 2023-11-30T07:24:47Z

@pytest.mark.parametrize(
    "method, lmbda",
    [
        ("box-cox", 0.1),
        ("box-cox", 0.5),
        ("yeo-johnson", 0.1),
        ("yeo-johnson", 0.5),
        ("yeo-johnson", 1.0),
    ],
)
def test_optimization_power_transformer(method, lmbda):
    # Test the optimization procedure:
    # - set a predefined value for lambda
    # - apply inverse_transform to a normal dist (we get X_inv)
    # - apply fit_transform to X_inv (we get X_inv_trans)
    # - check that X_inv_trans is roughly equal to X

    rng = np.random.RandomState(0)
    n_samples = 20000
    X = rng.normal(loc=0, scale=1, size=(n_samples, 1))

    pt = PowerTransformer(method=method, standardize=False)
    pt.lambdas_ = [lmbda]
    X_inv = pt.inverse_transform(X)

    pt = PowerTransformer(method=method, standardize=False)
    X_inv_trans = pt.fit_transform(X_inv)

    assert_almost_equal(0, np.linalg.norm(X - X_inv_trans) / n_samples, decimal=2)
    assert_almost_equal(0, X_inv_trans.mean(), decimal=1)
    assert_almost_equal(1, X_inv_trans.std(), decimal=1)

I think we need to reconsider the test above since some values of X are not possible with the preset lambda.

Recall that Box-Cox transformation requires $x>0$, therefore $\lambda y + 1 = x^\lambda > 0$, where $y=(x^\lambda-1)/\lambda$. However, when $y<-2$, which is possible for random generated data, $\lambda y + 1<0$ for $\lambda=0.5$.

And the build error is during computing the inverse.

lmbda = 0.5
x = -2.1
print((x * lmbda + 1) ** (1 / lmbda)) # 0.0025000000000000044 (scikit-learn's inverse of box-cox)
print(np.exp(np.log1p(lmbda * x) / lmbda)) # nan (scipy's inverse of box-cox)

adrinjalali · 2024-08-13T13:05:57Z

In the original PR #10210, inv_boxcox was used (780baf7) and then later removed (df2b372), and I'm not sure why.

Maybe @glemaitre would remember? @jjerphan might also have an idea here.

jjerphan · 2024-08-13T15:00:52Z

Sorry, I don't have any idea nor time to have a look at it at the moment. 😕

adrinjalali · 2024-08-13T15:13:59Z

@xuefeng-xu you're right regarding the test as well. It would be nice if you could fix the test for this change as well and improve the test. We can also add a test to make sure if input is not resonable, we output nan, and we add a bug fix entry in the changelog.

xuefeng-xu · 2024-08-20T17:16:45Z

In the original PR #10210, inv_boxcox was used (780baf7) and then later removed (df2b372)

It seems that the minimum supported scipy doesn't have special.inv_boxcox at that time, that's why we implemeted ourself?
See the discussion between these two commits #6781 (comment) and #6781 (comment)

I have fixed the broken test and added a new one.

adrinjalali

LGTM. This might require a changelog entry since it's changing the behavior slightly though.

cc @glemaitre

thomasjpfan

Thank you for the PR @xuefeng-xu !

thomasjpfan · 2024-08-22T03:45:22Z

sklearn/preprocessing/tests/test_data.py

@@ -2312,7 +2312,7 @@ def test_power_transformer_lambda_one():
    "method, lmbda",
    [
        ("box-cox", 0.1),
-        ("box-cox", 0.5),
+        ("box-cox", 0.2),


I think the test is not generating valid data for a given value of lambda. Setting lambda=0.2 resolves the issue because the generated data happens to be all less than -5.

Can we work around this by clipping X by 1/lambda when the method is box-coox?

X = rng.normal(loc=0, scale=1, size=(n_samples, 1)) if method == "box-cox": # For box-cox, means that lmbda * y + 1 > 0 or y > - 1 / lmbda # Clip the data here to make sure the inequality is valid. X = np.clip(X, - 1 / lmbda, None)

Thank you! Clipping the data is a good idea, but may need to add a safety number, e.g., 1e-5.

X = np.clip(X, - 1 / lmbda + 1e-5, None)

from scipy.special import inv_boxcox lmbda = 0.5 inv_boxcox(-1/lmbda, lmbda) # 0.0 (boxcox requires strictly positive data) inv_boxcox(-1/lmbda + 1e-5, lmbda) # 2.5e-11

Yup, adding 1e-5 makes sense to me.

thomasjpfan

LGTM

)

ENH Use scipy.special.inv_boxcox in PowerTransformer

2b364bc

github-actions bot added the module:preprocessing label Nov 30, 2023

xuefeng-xu added 3 commits August 21, 2024 00:39

Merge remote-tracking branch 'upstream/main' into inv_boxcox

89335c6

fix broken test

807628e

add test for inverse box-cox

0bb6a09

adrinjalali approved these changes Aug 21, 2024

View reviewed changes

add changelog entry in v1.6

e7e627b

thomasjpfan reviewed Aug 22, 2024

View reviewed changes

xuefeng-xu added 2 commits August 29, 2024 17:31

use clipping to make sure the data is valid

4947a3e

lint

072028d

thomasjpfan approved these changes Aug 29, 2024

View reviewed changes

thomasjpfan merged commit 6b3f9bd into scikit-learn:main Aug 29, 2024
29 checks passed

xuefeng-xu deleted the inv_boxcox branch August 29, 2024 12:29

MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024

ENH Use scipy.special.inv_boxcox in PowerTransformer (scikit-learn#27875

8faaf0f

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Use scipy.special.inv_boxcox in PowerTransformer #27875

ENH Use scipy.special.inv_boxcox in PowerTransformer #27875

Uh oh!

xuefeng-xu commented Nov 30, 2023

Uh oh!

github-actions bot commented Nov 30, 2023 •

edited

Loading

Uh oh!

xuefeng-xu commented Nov 30, 2023

Uh oh!

adrinjalali commented Aug 13, 2024 •

edited

Loading

Uh oh!

jjerphan commented Aug 13, 2024

Uh oh!

adrinjalali commented Aug 13, 2024

Uh oh!

xuefeng-xu commented Aug 20, 2024

Uh oh!

adrinjalali left a comment

Uh oh!

thomasjpfan left a comment

Uh oh!

thomasjpfan Aug 22, 2024

Uh oh!

xuefeng-xu Aug 22, 2024

Uh oh!

thomasjpfan Aug 28, 2024 •

edited

Loading

Uh oh!

thomasjpfan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH Use scipy.special.inv_boxcox in PowerTransformer #27875

ENH Use scipy.special.inv_boxcox in PowerTransformer #27875

Uh oh!

Conversation

xuefeng-xu commented Nov 30, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

xuefeng-xu commented Nov 30, 2023

Uh oh!

adrinjalali commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented Aug 13, 2024

Uh oh!

adrinjalali commented Aug 13, 2024

Uh oh!

xuefeng-xu commented Aug 20, 2024

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Aug 22, 2024

Choose a reason for hiding this comment

Uh oh!

xuefeng-xu Aug 22, 2024

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 30, 2023 •

edited

Loading

adrinjalali commented Aug 13, 2024 •

edited

Loading

thomasjpfan Aug 28, 2024 •

edited

Loading