Skip to content

Feature/26308 yeo johnson2 #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

smarie
Copy link

@smarie smarie commented Feb 18, 2025

Reference Issues/PRs

This PR is finalizing scikit-learn#27818 to close scikit-learn#26308 and also fix the warning issue reported in scikit-learn#23319 (comment)

What does this implement/fix? Explain your changes.

Use scipy.stats.yeojohnson instead of our own implementation as @lorentzenchr suggested.

Any other comments?

Copy link

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/preprocessing/_data.py	2025-02-18 15:24:07.940936+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/preprocessing/_data.py	2025-02-18 15:24:28.634307+00:00
@@ -3524,14 +3524,14 @@
 
             return -loglike
 
         # the computation of lambda is influenced by NaNs so we need to
         # get rid of them
-        #x = x[~np.isnan(x)]
+        # x = x[~np.isnan(x)]
         # choosing bracket -2, 2 like for boxcox
-        #return optimize.brent(_neg_log_likelihood, brack=(-2, 2))
-        _yeojohnson_lambda( _neg_log_likelihood, x)
+        # return optimize.brent(_neg_log_likelihood, brack=(-2, 2))
+        _yeojohnson_lambda(_neg_log_likelihood, x)
 
     def _check_input(self, X, in_fit, check_positive=False, check_shape=False):
         """Validate the input before fit and transform.
 
         Parameters
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/preprocessing/_data.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/preprocessing/tests/test_data.py	2025-02-18 15:24:07.941936+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/preprocessing/tests/test_data.py	2025-02-18 15:24:30.217980+00:00
@@ -2618,10 +2618,11 @@
         if standardize:
             assert_allclose(Xt_, np.zeros_like(X))
         else:
             assert_allclose(Xt_, X)
 
+
 def test_yeojohnson_for_different_scipy_version():
-     """Check the result are the same for different scipy version"""
-     # https://github.com/scikit-learn/scikit-learn/pull/27818
-     pt = PowerTransformer(method="yeo-johnson").fit(X_1col)
-     assert_almost_equal(pt.lambdas_[0], 0.99999353)
+    """Check the result are the same for different scipy version"""
+    # https://github.com/scikit-learn/scikit-learn/pull/27818
+    pt = PowerTransformer(method="yeo-johnson").fit(X_1col)
+    assert_almost_equal(pt.lambdas_[0], 0.99999353)
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/preprocessing/tests/test_data.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/utils/fixes.py	2025-02-18 15:24:07.954936+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/utils/fixes.py	2025-02-18 15:24:31.532372+00:00
@@ -84,12 +84,12 @@
 
 def _object_dtype_isnan(X):
     return X != X
 
 
-#TODO:  remove this when minimum version of scipy >= 1.9.0
-def _yeojohnson_lambda( _neg_log_likelihood, x):
+# TODO:  remove this when minimum version of scipy >= 1.9.0
+def _yeojohnson_lambda(_neg_log_likelihood, x):
 
     x = x[~np.isnan(x)]
     scipy_version = parse_version(scipy.__version__)
     min_scipy_version = "1.9.0"
     if scipy_version < parse_version(min_scipy_version):
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/utils/fixes.py

Oh no! 💥 💔 💥
3 files would be reformatted, 918 files would be left unchanged.

ruff

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


sklearn/preprocessing/_data.py:9:19: F401 [*] `scipy.optimize` imported but unused
   |
 8 | import numpy as np
 9 | from scipy import optimize, sparse, stats
   |                   ^^^^^^^^ F401
10 | from scipy.special import boxcox, inv_boxcox
   |
   = help: Remove unused import: `scipy.optimize`

sklearn/utils/fixes.py:10:1: I001 [*] Import block is un-sorted or un-formatted
   |
 8 |   # SPDX-License-Identifier: BSD-3-Clause
 9 |   
10 | / import platform
11 | | import struct
12 | | import sys
13 | | 
14 | | import numpy as np
15 | | import scipy
16 | | from scipy import optimize
17 | | from scipy import stats
18 | | import scipy.sparse.linalg
19 | | import scipy.stats
20 | | 
21 | | try:
   | |_^ I001
22 |       import pandas as pd
23 |   except ImportError:
   |
   = help: Organize imports

Found 2 errors.
[*] 2 fixable with the `--fix` option.

Generated for commit: 3013d88. Link to the linter CI: here

@@ -84,6 +86,20 @@ def _object_dtype_isnan(X):
return X != X


#TODO: remove this when minimum version of scipy >= 1.9.0
def _yeojohnson_lambda( _neg_log_likelihood, x):

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstring here

#TODO: remove this when minimum version of scipy >= 1.9.0
def _yeojohnson_lambda( _neg_log_likelihood, x):

x = x[~np.isnan(x)]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line move or not ? (check wrt scipy implementation to get the answer)

@@ -3525,9 +3526,10 @@ def _neg_log_likelihood(lmbda):

# the computation of lambda is influenced by NaNs so we need to
# get rid of them
x = x[~np.isnan(x)]
#x = x[~np.isnan(x)]
Copy link
Author

@smarie smarie Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather leave this here and remove it from the other side

Suggested change
#x = x[~np.isnan(x)]
x = x[~np.isnan(x)]

# choosing bracket -2, 2 like for boxcox
return optimize.brent(_neg_log_likelihood, brack=(-2, 2))
#return optimize.brent(_neg_log_likelihood, brack=(-2, 2))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment and the above comment as they should be in the new function now

@@ -2619,3 +2619,9 @@ def test_power_transformer_constant_feature(standardize):
assert_allclose(Xt_, np.zeros_like(X))
else:
assert_allclose(Xt_, X)

Copy link
Author

@smarie smarie Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use scipy.stats.yeojohnson PowerTransformer
1 participant