ENH Use scipy.stats.yeojohnson in PowerTransformer #27818

xuefeng-xu · 2023-11-21T05:56:33Z

Reference Issues/PRs

Closes #26308

What does this implement/fix? Explain your changes.

Use scipy.stats.yeojohnson instead of our own implementation as @lorentzenchr suggested.

Any other comments?

github-actions · 2023-11-21T05:57:51Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 5ed9133. Link to the linter CI: here}

xuefeng-xu · 2023-11-21T07:29:44Z

This build failure was solved in scipy/scipy#15998, requires scipy>=1.9.0

glemaitre · 2023-11-22T15:22:40Z

OK, so I would probably wait that we bump the minimum version to scipy 1.9 in this case. Can you report the failure and your finding in the original issue?

lorentzenchr · 2023-11-29T09:41:32Z

@xuefeng-xu For the time being, you can make the new code path conditional on scipy>=1.9 with a todo note to remove the old code once our minimum scipy is >= 1.9.

xuefeng-xu · 2023-11-29T11:19:27Z

@lorentzenchr @glemaitre I have updated the code, now it's ready for review.

lorentzenchr · 2024-04-10T18:23:00Z

Maybe, there’s a misunderstanding. Could make the call to scipy.yeojohnson conditional on the installes scipy version >= 1.9 and call the old code path otherwise.
Also, please add a test that both versions are equivalent.

xuefeng-xu · 2024-04-11T03:25:13Z

@lorentzenchr Do you mean I should keep the _yeo_johnson_transform function?

Meanwhile, can I use this test below, where X_1col is defined here

def test_yeojohnson_for_different_scipy_version():
    pt = PowerTransformer(method="yeo-johnson").fit(X_1col)
    assert_almost_equal(pt.lambdas_[0], 0.99546157)

lorentzenchr · 2024-04-11T05:45:02Z

@lorentzenchr Do you mean I should keep the _yeo_johnson_transform function?

Yes because we still support older scipy versions.

xuefeng-xu · 2024-04-11T06:55:12Z

~~Not sure why the results are different.~~
The X_1col is different when I manually create it, so the results are different. (This seems weird since I use the same RandomState.)

I tested two scipy versions (1.8.0 and 1.11.1), both of them pass the test case.

JaKasb · 2024-11-20T18:39:55Z

Here is a minimal implementation of PowerTransformer, using the scipy.stats methods.
Contrary to the scikit-learn variant , yeojohnson_normmax() manages to find lambda values without overflows and errors.
This snippet assumes that X is a pandas DataFrame.

The parallelization does not scale well with the number of cores.
3 Threads with backend='threading'
circa 8 Threads with backend='loky'

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin, OneToOneFeatureMixin
from joblib import Parallel, delayed
from scipy.stats import yeojohnson_normmax, yeojohnson

class YeoJohnsonTransformer(BaseEstimator, TransformerMixin, OneToOneFeatureMixin):
    def fit(self, X : pd.DataFrame, y=None):
        self._n_rows, self._n_cols = X.shape
        
        jobs = []
        for col_idx in range(self._n_cols):
            col = X.iloc[:,col_idx]
            col = col.to_numpy()[np.isfinite(col)] # drop inf and nan , convert to numpy array
            jobs.append( delayed(yeojohnson_normmax)(col) )

        self._lambdas = Parallel(n_jobs=4, backend='threading', verbose=0)(jobs)
        return self

    
    def transform(self, X : pd.DataFrame):
        X_out = pd.DataFrame(data=np.empty_like(X, dtype=np.float64), index=X.index, columns=X.columns, dtype=np.float64)
        
        for col_idx, lambda_i in zip(range(self._n_cols), self._lambdas):
            col = X.iloc[:,col_idx].to_numpy()
            X_out.iloc[:,col_idx] = yeojohnson(col, lmbda=lambda_i)
        return X_out

smarie · 2025-02-04T15:27:35Z

Hello,
It seems that this PR has been pending for a while. However it seems almost complete.

Could you confirm that the only remaining tasks are

to improve the coverage
to update the changelog

@lesteve ?

In order to improve the coverage, we need to test against scipy 1.9. Which strategy do you suggest to perform this ? Is there a specific CI runner to configure to add scipy 1.9 ?

If needed we ( https://github.com/Projet-open-source - a student project team that I coach ) could give it a try

lesteve · 2025-02-04T16:24:18Z

From a quick look:

use sklearn/utils/fixes.py to put the function that switches to scipy implementation if scipy is recent enough (look around at other examples in this file for inspiration)
add a non-regression test and make sure they fix the original issue for example Yeo-Johnson Power Transformer gives Numpy warning (and raises scipy.optimize._optimize.BracketError in some cases) #23319 (comment). According to ENH Use scipy.stats.yeojohnson in PowerTransformer #27818 (comment) it seems the test that was added was not a good non-regression test
add changelog and mention @xuefeng-xu as co-author. Instructions for adding a changelog are in https://github.com/scikit-learn/scikit-learn/tree/main/doc/whats_new/upcoming_changes#readme.

We already have a CI that tests the minimal scipy version, currently 1.6 according to

scikit-learn/sklearn/_min_dependencies.py

Line 11 in 9e78dca

SCIPY_MIN_VERSION = "1.6.0"

).

xuefeng-xu · 2025-02-04T22:23:22Z

Hi @lesteve , a few questions just to make sure I understand correctly.

use sklearn/utils/fixes.py

Do you mean I need to move the code inside the _yeo_johnson_optimize function to the fixes.py file? Doing this also need to add the _neg_log_likelihood and _yeo_johnson_transform functions.

add a non-regression test

The overflow issue was fixed in scipy 1.12 (see scipy/scipy#18852 (comment)). Do I need to add similar test using the overflow data?

lesteve · 2025-02-05T06:57:25Z

@xuefeng-xu sorry I should have been more explicit, my comment was for @smarie. For clarity's sake, @smarie contacted me because he has a small project working with engineering school students to introduce them to open-source contribution.

I came across this PR when I was doing some triage a few weeks ago through #30281 and thought this would be a good PR to finish. My assumption was that as many other scikit-learn PRs, too much time had passed since it was opened for the author (in this case you) to be still interested in contributing ...

As I mentioned my plan was to mention you in the changelog and add you as coauthor when the new PR is merged. Hopefully you find this acceptable 🙏.

smarie · 2025-02-05T08:42:57Z

Thanks @lesteve ! To be very clear : @xuefeng-xu if you have enough time to perform all items that are mentioned above in the upcoming weeks, please do so - there are plenty of other issues to contribute to in scikit-learn so we will find something else :).

However if you do not have the bandwidth in the upcoming weeks, as suggested by @lesteve we will complete the work on your behalf and of course add you as coauthor.

Let us know !

xuefeng-xu · 2025-02-05T09:41:25Z

Thank you @lesteve @smarie ! Please feel free to open a new PR — it's wonderful to hear about the opportunity for students to get involved in open-source contributions.

lorentzenchr · 2025-03-31T06:49:54Z

Who is now working on this issue in which PR? I would really like to get it in.

lesteve · 2025-03-31T07:46:25Z

@lorentzenchr it looks like there was some activity last week in Projet-open-source#3.

My understanding is that @smarie is first working with students in a fork PR and that they will open a scikit-learn PR when it is closer to completion 😉. Once they open the scikit-learn PR, I guess they should probably ping you and me, because we are both interested by helping to get it in 🤞.

smarie · 2025-04-03T09:31:41Z

Indeed @lesteve we're almost there :) but since this is a student project we did a first pass outside of this repo so as not to pollute too much. You should hear from us very shortly

ENH Use scipy.stats.yeojohnson in PowerTransformer

f27e5f9

github-actions bot added the module:preprocessing label Nov 21, 2023

This was referenced Nov 22, 2023

PowerTransformer 'divide by zero encountered in log' + proposed fix #14959

Closed

Use scipy.stats.yeojohnson PowerTransformer #26308

Closed

xuefeng-xu added 2 commits November 29, 2023 18:17

add a branch to check scipy version

9e6de60

Merge remote-tracking branch 'upstream/main' into yeo-johnson

7baa8cc

xuefeng-xu added 2 commits April 11, 2024 14:11

keep orginal _yeo_johnson_transform function

8e41e11

add test

dfe294a

correct the lambda value

5ed9133

lesteve mentioned this pull request Nov 15, 2024

scipy.optimize._optimize.BracketError in some cases of power transformer #30281

Closed

ncaptier mentioned this pull request Feb 10, 2025

Encountered problems during the replication process sysbio-curie/deep-multipit#1

Open

smarie mentioned this pull request Mar 21, 2025

Feature/26308 yeo johnson2 Projet-open-source/scikit-learn#3

Open

yaichm mentioned this pull request Apr 19, 2025

ENH Use scipy Yeo-Johnson implementation in PowerTransformer for scipy >= 1.9 #31227

Merged

lesteve closed this in #31227 May 5, 2025

github-project-automation bot moved this from Open to Closed in @xuefeng-xu's scikit-learn project May 5, 2025

xuefeng-xu deleted the yeo-johnson branch May 6, 2025 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Use scipy.stats.yeojohnson in PowerTransformer #27818

ENH Use scipy.stats.yeojohnson in PowerTransformer #27818

xuefeng-xu commented Nov 21, 2023

github-actions bot commented Nov 21, 2023 •

edited

Loading

xuefeng-xu commented Nov 21, 2023

glemaitre commented Nov 22, 2023

lorentzenchr commented Nov 29, 2023 •

edited

Loading

xuefeng-xu commented Nov 29, 2023

lorentzenchr commented Apr 10, 2024

xuefeng-xu commented Apr 11, 2024

lorentzenchr commented Apr 11, 2024 •

edited

Loading

xuefeng-xu commented Apr 11, 2024 •

edited

Loading

JaKasb commented Nov 20, 2024 •

edited

Loading

smarie commented Feb 4, 2025 •

edited

Loading

lesteve commented Feb 4, 2025 •

edited

Loading

xuefeng-xu commented Feb 4, 2025 •

edited

Loading

lesteve commented Feb 5, 2025

smarie commented Feb 5, 2025 •

edited

Loading

xuefeng-xu commented Feb 5, 2025

lorentzenchr commented Mar 31, 2025

lesteve commented Mar 31, 2025 •

edited

Loading

smarie commented Apr 3, 2025 •

edited

Loading

ENH Use scipy.stats.yeojohnson in PowerTransformer #27818

ENH Use scipy.stats.yeojohnson in PowerTransformer #27818

Conversation

xuefeng-xu commented Nov 21, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Nov 21, 2023 • edited Loading

✔️ Linting Passed

xuefeng-xu commented Nov 21, 2023

glemaitre commented Nov 22, 2023

lorentzenchr commented Nov 29, 2023 • edited Loading

xuefeng-xu commented Nov 29, 2023

lorentzenchr commented Apr 10, 2024

xuefeng-xu commented Apr 11, 2024

lorentzenchr commented Apr 11, 2024 • edited Loading

xuefeng-xu commented Apr 11, 2024 • edited Loading

JaKasb commented Nov 20, 2024 • edited Loading

smarie commented Feb 4, 2025 • edited Loading

lesteve commented Feb 4, 2025 • edited Loading

xuefeng-xu commented Feb 4, 2025 • edited Loading

lesteve commented Feb 5, 2025

smarie commented Feb 5, 2025 • edited Loading

xuefeng-xu commented Feb 5, 2025

lorentzenchr commented Mar 31, 2025

lesteve commented Mar 31, 2025 • edited Loading

smarie commented Apr 3, 2025 • edited Loading

github-actions bot commented Nov 21, 2023 •

edited

Loading

lorentzenchr commented Nov 29, 2023 •

edited

Loading

lorentzenchr commented Apr 11, 2024 •

edited

Loading

xuefeng-xu commented Apr 11, 2024 •

edited

Loading

JaKasb commented Nov 20, 2024 •

edited

Loading

smarie commented Feb 4, 2025 •

edited

Loading

lesteve commented Feb 4, 2025 •

edited

Loading

xuefeng-xu commented Feb 4, 2025 •

edited

Loading

smarie commented Feb 5, 2025 •

edited

Loading

lesteve commented Mar 31, 2025 •

edited

Loading

smarie commented Apr 3, 2025 •

edited

Loading