-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Use scipy.stats.yeojohnson in PowerTransformer #27818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This build failure was solved in scipy/scipy#15998, requires scipy>=1.9.0 |
OK, so I would probably wait that we bump the minimum version to scipy 1.9 in this case. Can you report the failure and your finding in the original issue? |
@xuefeng-xu For the time being, you can make the new code path conditional on scipy>=1.9 with a todo note to remove the old code once our minimum scipy is >= 1.9. |
@lorentzenchr @glemaitre I have updated the code, now it's ready for review. |
Maybe, there’s a misunderstanding. Could make the call to scipy.yeojohnson conditional on the installes scipy version >= 1.9 and call the old code path otherwise. |
@lorentzenchr Do you mean I should keep the Meanwhile, can I use this test below, where def test_yeojohnson_for_different_scipy_version():
pt = PowerTransformer(method="yeo-johnson").fit(X_1col)
assert_almost_equal(pt.lambdas_[0], 0.99546157) |
Yes because we still support older scipy versions. |
I tested two scipy versions (1.8.0 and 1.11.1), both of them pass the test case. |
Here is a minimal implementation of PowerTransformer, using the scipy.stats methods. The parallelization does not scale well with the number of cores. import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin, OneToOneFeatureMixin
from joblib import Parallel, delayed
from scipy.stats import yeojohnson_normmax, yeojohnson
class YeoJohnsonTransformer(BaseEstimator, TransformerMixin, OneToOneFeatureMixin):
def fit(self, X : pd.DataFrame, y=None):
self._n_rows, self._n_cols = X.shape
jobs = []
for col_idx in range(self._n_cols):
col = X.iloc[:,col_idx]
col = col.to_numpy()[np.isfinite(col)] # drop inf and nan , convert to numpy array
jobs.append( delayed(yeojohnson_normmax)(col) )
self._lambdas = Parallel(n_jobs=4, backend='threading', verbose=0)(jobs)
return self
def transform(self, X : pd.DataFrame):
X_out = pd.DataFrame(data=np.empty_like(X, dtype=np.float64), index=X.index, columns=X.columns, dtype=np.float64)
for col_idx, lambda_i in zip(range(self._n_cols), self._lambdas):
col = X.iloc[:,col_idx].to_numpy()
X_out.iloc[:,col_idx] = yeojohnson(col, lmbda=lambda_i)
return X_out |
Hello, Could you confirm that the only remaining tasks are
@lesteve ? In order to improve the coverage, we need to test against scipy 1.9. Which strategy do you suggest to perform this ? Is there a specific CI runner to configure to add scipy 1.9 ? If needed we ( https://github.com/Projet-open-source - a student project team that I coach ) could give it a try |
From a quick look:
We already have a CI that tests the minimal scipy version, currently 1.6 according to scikit-learn/sklearn/_min_dependencies.py Line 11 in 9e78dca
|
Hi @lesteve , a few questions just to make sure I understand correctly.
Do you mean I need to move the code inside the
The overflow issue was fixed in scipy 1.12 (see scipy/scipy#18852 (comment)). Do I need to add similar test using the overflow data? |
@xuefeng-xu sorry I should have been more explicit, my comment was for @smarie. For clarity's sake, @smarie contacted me because he has a small project working with engineering school students to introduce them to open-source contribution. I came across this PR when I was doing some triage a few weeks ago through #30281 and thought this would be a good PR to finish. My assumption was that as many other scikit-learn PRs, too much time had passed since it was opened for the author (in this case you) to be still interested in contributing ... As I mentioned my plan was to mention you in the changelog and add you as coauthor when the new PR is merged. Hopefully you find this acceptable 🙏. |
Thanks @lesteve ! To be very clear : @xuefeng-xu if you have enough time to perform all items that are mentioned above in the upcoming weeks, please do so - there are plenty of other issues to contribute to in scikit-learn so we will find something else :). However if you do not have the bandwidth in the upcoming weeks, as suggested by @lesteve we will complete the work on your behalf and of course add you as coauthor. Let us know ! |
Who is now working on this issue in which PR? I would really like to get it in. |
@lorentzenchr it looks like there was some activity last week in Projet-open-source#3. My understanding is that @smarie is first working with students in a fork PR and that they will open a scikit-learn PR when it is closer to completion 😉. Once they open the scikit-learn PR, I guess they should probably ping you and me, because we are both interested by helping to get it in 🤞. |
Indeed @lesteve we're almost there :) but since this is a student project we did a first pass outside of this repo so as not to pollute too much. You should hear from us very shortly |
Reference Issues/PRs
Closes #26308
What does this implement/fix? Explain your changes.
Use scipy.stats.yeojohnson instead of our own implementation as @lorentzenchr suggested.
Any other comments?