Skip to content

ENH Use scipy.special.inv_boxcox in PowerTransformer #27875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 29, 2024

Conversation

xuefeng-xu
Copy link
Contributor

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Since we already use scipy.special.boxcox, I think we can use scipy.special.inv_boxcox for the inverse too.
https://github.com/scipy/scipy/blob/fcf7b652bc27e47d215557bda61c84d19adc3aae/scipy/special/_boxcox.pxd#L30-L34

Any other comments?

The Box-Cox transformation.
image

Copy link

github-actions bot commented Nov 30, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 072028d. Link to the linter CI: here

@xuefeng-xu
Copy link
Contributor Author

@pytest.mark.parametrize(
    "method, lmbda",
    [
        ("box-cox", 0.1),
        ("box-cox", 0.5),
        ("yeo-johnson", 0.1),
        ("yeo-johnson", 0.5),
        ("yeo-johnson", 1.0),
    ],
)
def test_optimization_power_transformer(method, lmbda):
    # Test the optimization procedure:
    # - set a predefined value for lambda
    # - apply inverse_transform to a normal dist (we get X_inv)
    # - apply fit_transform to X_inv (we get X_inv_trans)
    # - check that X_inv_trans is roughly equal to X

    rng = np.random.RandomState(0)
    n_samples = 20000
    X = rng.normal(loc=0, scale=1, size=(n_samples, 1))

    pt = PowerTransformer(method=method, standardize=False)
    pt.lambdas_ = [lmbda]
    X_inv = pt.inverse_transform(X)

    pt = PowerTransformer(method=method, standardize=False)
    X_inv_trans = pt.fit_transform(X_inv)

    assert_almost_equal(0, np.linalg.norm(X - X_inv_trans) / n_samples, decimal=2)
    assert_almost_equal(0, X_inv_trans.mean(), decimal=1)
    assert_almost_equal(1, X_inv_trans.std(), decimal=1)

I think we need to reconsider the test above since some values of X are not possible with the preset lambda.

Recall that Box-Cox transformation requires $x>0$, therefore $\lambda y + 1 = x^\lambda > 0$, where $y=(x^\lambda-1)/\lambda$. However, when $y<-2$, which is possible for random generated data, $\lambda y + 1<0$ for $\lambda=0.5$.

And the build error is during computing the inverse.

lmbda = 0.5
x = -2.1
print((x * lmbda + 1) ** (1 / lmbda)) # 0.0025000000000000044 (scikit-learn's inverse of box-cox)
print(np.exp(np.log1p(lmbda * x) / lmbda)) # nan (scipy's inverse of box-cox)

@adrinjalali
Copy link
Member

adrinjalali commented Aug 13, 2024

In the original PR #10210, inv_boxcox was used (780baf7) and then later removed (df2b372), and I'm not sure why.

Maybe @glemaitre would remember? @jjerphan might also have an idea here.

@jjerphan
Copy link
Member

Sorry, I don't have any idea nor time to have a look at it at the moment. 😕

@adrinjalali
Copy link
Member

@xuefeng-xu you're right regarding the test as well. It would be nice if you could fix the test for this change as well and improve the test. We can also add a test to make sure if input is not resonable, we output nan, and we add a bug fix entry in the changelog.

@xuefeng-xu
Copy link
Contributor Author

In the original PR #10210, inv_boxcox was used (780baf7) and then later removed (df2b372)

It seems that the minimum supported scipy doesn't have special.inv_boxcox at that time, that's why we implemeted ourself?
See the discussion between these two commits #6781 (comment) and #6781 (comment)

I have fixed the broken test and added a new one.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This might require a changelog entry since it's changing the behavior slightly though.

cc @glemaitre

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @xuefeng-xu !

@@ -2312,7 +2312,7 @@ def test_power_transformer_lambda_one():
"method, lmbda",
[
("box-cox", 0.1),
("box-cox", 0.5),
("box-cox", 0.2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is not generating valid data for a given value of lambda. Setting lambda=0.2 resolves the issue because the generated data happens to be all less than -5.

Can we work around this by clipping X by 1/lambda when the method is box-coox?

X = rng.normal(loc=0, scale=1, size=(n_samples, 1))

if method == "box-cox":
    # For box-cox, means that lmbda * y + 1 > 0 or y > - 1 / lmbda
    # Clip the data here to make sure the inequality is valid.
    X = np.clip(X, - 1 / lmbda, None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Clipping the data is a good idea, but may need to add a safety number, e.g., 1e-5.

X = np.clip(X, - 1 / lmbda + 1e-5, None)
from scipy.special import inv_boxcox

lmbda = 0.5
inv_boxcox(-1/lmbda, lmbda) # 0.0 (boxcox requires strictly positive data)
inv_boxcox(-1/lmbda + 1e-5, lmbda) # 2.5e-11

Copy link
Member

@thomasjpfan thomasjpfan Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, adding 1e-5 makes sense to me.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit 6b3f9bd into scikit-learn:main Aug 29, 2024
29 checks passed
@xuefeng-xu xuefeng-xu deleted the inv_boxcox branch August 29, 2024 12:29
MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants