Skip to content

ENH Use scipy.stats.yeojohnson in PowerTransformer #27818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

xuefeng-xu
Copy link
Contributor

Reference Issues/PRs

Closes #26308

What does this implement/fix? Explain your changes.

Use scipy.stats.yeojohnson instead of our own implementation as @lorentzenchr suggested.

Any other comments?

Copy link

github-actions bot commented Nov 21, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5ed9133. Link to the linter CI: here

@xuefeng-xu
Copy link
Contributor Author

This build failure was solved in scipy/scipy#15998, requires scipy>=1.9.0

@glemaitre
Copy link
Member

OK, so I would probably wait that we bump the minimum version to scipy 1.9 in this case. Can you report the failure and your finding in the original issue?

@lorentzenchr
Copy link
Member

lorentzenchr commented Nov 29, 2023

@xuefeng-xu For the time being, you can make the new code path conditional on scipy>=1.9 with a todo note to remove the old code once our minimum scipy is >= 1.9.

@xuefeng-xu
Copy link
Contributor Author

@lorentzenchr @glemaitre I have updated the code, now it's ready for review.

@lorentzenchr
Copy link
Member

Maybe, there’s a misunderstanding. Could make the call to scipy.yeojohnson conditional on the installes scipy version >= 1.9 and call the old code path otherwise.
Also, please add a test that both versions are equivalent.

@xuefeng-xu
Copy link
Contributor Author

@lorentzenchr Do you mean I should keep the _yeo_johnson_transform function?

Meanwhile, can I use this test below, where X_1col is defined here

def test_yeojohnson_for_different_scipy_version():
    pt = PowerTransformer(method="yeo-johnson").fit(X_1col)
    assert_almost_equal(pt.lambdas_[0], 0.99546157)

@lorentzenchr
Copy link
Member

lorentzenchr commented Apr 11, 2024

@lorentzenchr Do you mean I should keep the _yeo_johnson_transform function?

Yes because we still support older scipy versions.

@xuefeng-xu
Copy link
Contributor Author

xuefeng-xu commented Apr 11, 2024

Not sure why the results are different.
The X_1col is different when I manually create it, so the results are different. (This seems weird since I use the same RandomState.)

I tested two scipy versions (1.8.0 and 1.11.1), both of them pass the test case.

@JaKasb
Copy link

JaKasb commented Nov 20, 2024

Here is a minimal implementation of PowerTransformer, using the scipy.stats methods.
Contrary to the scikit-learn variant , yeojohnson_normmax() manages to find lambda values without overflows and errors.
This snippet assumes that X is a pandas DataFrame.

The parallelization does not scale well with the number of cores.
3 Threads with backend='threading'
circa 8 Threads with backend='loky'

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin, OneToOneFeatureMixin
from joblib import Parallel, delayed
from scipy.stats import yeojohnson_normmax, yeojohnson

class YeoJohnsonTransformer(BaseEstimator, TransformerMixin, OneToOneFeatureMixin):
    def fit(self, X : pd.DataFrame, y=None):
        self._n_rows, self._n_cols = X.shape
        
        jobs = []
        for col_idx in range(self._n_cols):
            col = X.iloc[:,col_idx]
            col = col.to_numpy()[np.isfinite(col)] # drop inf and nan , convert to numpy array
            jobs.append( delayed(yeojohnson_normmax)(col) )

        self._lambdas = Parallel(n_jobs=4, backend='threading', verbose=0)(jobs)
        return self

    
    def transform(self, X : pd.DataFrame):
        X_out = pd.DataFrame(data=np.empty_like(X, dtype=np.float64), index=X.index, columns=X.columns, dtype=np.float64)
        
        for col_idx, lambda_i in zip(range(self._n_cols), self._lambdas):
            col = X.iloc[:,col_idx].to_numpy()
            X_out.iloc[:,col_idx] = yeojohnson(col, lmbda=lambda_i)
        return X_out

@smarie
Copy link
Contributor

smarie commented Feb 4, 2025

Hello,
It seems that this PR has been pending for a while. However it seems almost complete.

Could you confirm that the only remaining tasks are

  • to improve the coverage
  • to update the changelog

@lesteve ?

In order to improve the coverage, we need to test against scipy 1.9. Which strategy do you suggest to perform this ? Is there a specific CI runner to configure to add scipy 1.9 ?

If needed we ( https://github.com/Projet-open-source - a student project team that I coach ) could give it a try

@lesteve
Copy link
Member

lesteve commented Feb 4, 2025

From a quick look:

We already have a CI that tests the minimal scipy version, currently 1.6 according to

SCIPY_MIN_VERSION = "1.6.0"
).

@xuefeng-xu
Copy link
Contributor Author

xuefeng-xu commented Feb 4, 2025

Hi @lesteve , a few questions just to make sure I understand correctly.

use sklearn/utils/fixes.py

Do you mean I need to move the code inside the _yeo_johnson_optimize function to the fixes.py file? Doing this also need to add the _neg_log_likelihood and _yeo_johnson_transform functions.

add a non-regression test

The overflow issue was fixed in scipy 1.12 (see scipy/scipy#18852 (comment)). Do I need to add similar test using the overflow data?

@lesteve
Copy link
Member

lesteve commented Feb 5, 2025

@xuefeng-xu sorry I should have been more explicit, my comment was for @smarie. For clarity's sake, @smarie contacted me because he has a small project working with engineering school students to introduce them to open-source contribution.

I came across this PR when I was doing some triage a few weeks ago through #30281 and thought this would be a good PR to finish. My assumption was that as many other scikit-learn PRs, too much time had passed since it was opened for the author (in this case you) to be still interested in contributing ...

As I mentioned my plan was to mention you in the changelog and add you as coauthor when the new PR is merged. Hopefully you find this acceptable 🙏.

@smarie
Copy link
Contributor

smarie commented Feb 5, 2025

Thanks @lesteve ! To be very clear : @xuefeng-xu if you have enough time to perform all items that are mentioned above in the upcoming weeks, please do so - there are plenty of other issues to contribute to in scikit-learn so we will find something else :).

However if you do not have the bandwidth in the upcoming weeks, as suggested by @lesteve we will complete the work on your behalf and of course add you as coauthor.

Let us know !

@xuefeng-xu
Copy link
Contributor Author

Thank you @lesteve @smarie ! Please feel free to open a new PR — it's wonderful to hear about the opportunity for students to get involved in open-source contributions.

@lorentzenchr
Copy link
Member

Who is now working on this issue in which PR? I would really like to get it in.

@lesteve
Copy link
Member

lesteve commented Mar 31, 2025

@lorentzenchr it looks like there was some activity last week in Projet-open-source#3.

My understanding is that @smarie is first working with students in a fork PR and that they will open a scikit-learn PR when it is closer to completion 😉. Once they open the scikit-learn PR, I guess they should probably ping you and me, because we are both interested by helping to get it in 🤞.

@smarie
Copy link
Contributor

smarie commented Apr 3, 2025

Indeed @lesteve we're almost there :) but since this is a student project we did a first pass outside of this repo so as not to pollute too much. You should hear from us very shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use scipy.stats.yeojohnson PowerTransformer
6 participants