Description
Describe the bug
I encountered this error for the first time while transforming a metabolomics dataset using power transformer. Prior to using PowerTransformer I had imputed the dataset with "median" strategy (using SimpleImputer), which in this case means making all the missing values 1.0 because this dataset was produced to have a 1.0 median for all features. After various trouble shooting steps I have found out that there are some data inputs that consistently produce this numpy "BracketError" error. It is likely to happen when you have a feature that contains all the same values. The error can go away by changing number of rows or changing values. In other words, you can create different datasets that give the error every time, and with a small change to those datasets they no longer produce the error.
Here is some code that produces the error:
import numpy as np
from sklearn.preprocessing import PowerTransformer
data = np.array([0.9] * 400)
transformed_data = PowerTransformer().fit_transform(data.reshape(-1, 1))
if you manipulate the array value and length you will find that some input data produces the error and some input data does not.
Eg. an array of [1.1] * 400
will not produce the error but [1.0] * 400
produces the error.
Eg. data = [1] * 9
(and * 8
, * 7
, * 6
, * 5
, ...) produces the error, while data = [1] * 10
does not.
I had the feeling that I made this error occur also with columns that contained a few more than just one unique value (2, 3, and possibly even 4 unique values), with the rest being 1.0, but i was not able to reproduce that while writing this report, and I might be mistaken (I even wrote a function that made thousands of random iterations with this type of data to try and reproduce this, but came up empty handed).
The error does not tell you what or why this is happening. My dataset consists of over 6000 rows and 900 features and the error did not tell me which part of the data was producing the error. By an educated lucky guess I thought it might be related to having used SimpleImputer strategy=median on features with a huge range of missing values, including some features that had only a few non-missing values, and I tested this hypothesis by finding and removing features that had 3 or less non-missing values before imputation, and indeed that got rid of the error, which led me to investigating more.
If this occurs with other versions too, I suggest either adding a sentence about this in the documentation of power transformer
eg. "Warning: features in which all the values are the same may produce a numpy BracketError", or something of that nature. (as I said, I was not able to prove that this can occur with features that have more than one unique value)
As a side note, using this type of data frequently produces a couple of numpy warnings:
Lib\site-packages\numpy\core\_methods.py:176: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
and
Lib\site-packages\numpy\core\_methods.py:187: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
Steps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import PowerTransformer
data = np.array([0.9] * 400)
transformed_data = PowerTransformer().fit_transform(data.reshape(-1, 1))
Expected Results
No error.
Actual Results
BracketError Traceback (most recent call last)
Cell In[63], line 5
3 data = [0.9] * 400
4 df = pd.DataFrame(data)
----> 5 df['feature_transformed'] = PowerTransformer().fit_transform(df)
6 print(df)
File ~\anaconda3\envs\***\Lib\site-packages\sklearn\utils\_set_output.py:140, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
138 @wraps(f)
139 def wrapped(self, X, *args, **kwargs):
--> 140 data_to_wrap = f(self, X, *args, **kwargs)
141 if isinstance(data_to_wrap, tuple):
142 # only wrap the first output for cross decomposition
143 return (
144 _wrap_data_with_container(method, data_to_wrap[0], X, self),
145 *data_to_wrap[1:],
146 )
File ~\anaconda3\envs\***\Lib\site-packages\sklearn\preprocessing\_data.py:3103, in PowerTransformer.fit_transform(self, X, y)
3086 """Fit `PowerTransformer` to `X`, then transform `X`.
3087
3088 Parameters
(...)
3100 Transformed data.
3101 """
3102 self._validate_params()
-> 3103 return self._fit(X, y, force_transform=True)
File ~\anaconda3\envs\***\Lib\site-packages\sklearn\preprocessing\_data.py:3116, in PowerTransformer._fit(self, X, y, force_transform)
3111 optim_function = {
3112 "box-cox": self._box_cox_optimize,
3113 "yeo-johnson": self._yeo_johnson_optimize,
3114 }[self.method]
3115 with np.errstate(invalid="ignore"): # hide NaN warnings
-> 3116 self.lambdas_ = np.array([optim_function(col) for col in X.T])
3118 if self.standardize or force_transform:
3119 transform_function = {
3120 "box-cox": boxcox,
3121 "yeo-johnson": self._yeo_johnson_transform,
3122 }[self.method]
File ~\anaconda3\envs\***\Lib\site-packages\sklearn\preprocessing\_data.py:3116, in <listcomp>(.0)
3111 optim_function = {
3112 "box-cox": self._box_cox_optimize,
3113 "yeo-johnson": self._yeo_johnson_optimize,
3114 }[self.method]
3115 with np.errstate(invalid="ignore"): # hide NaN warnings
-> 3116 self.lambdas_ = np.array([optim_function(col) for col in X.T])
3118 if self.standardize or force_transform:
3119 transform_function = {
3120 "box-cox": boxcox,
3121 "yeo-johnson": self._yeo_johnson_transform,
3122 }[self.method]
File ~\anaconda3\envs\***\Lib\site-packages\sklearn\preprocessing\_data.py:3307, in PowerTransformer._yeo_johnson_optimize(self, x)
3305 x = x[~np.isnan(x)]
3306 # choosing bracket -2, 2 like for boxcox
-> 3307 return optimize.brent(_neg_log_likelihood, brack=(-2, 2))
File ~\anaconda3\envs\***\Lib\site-packages\scipy\optimize\_optimize.py:2641, in brent(func, args, brack, tol, full_output, maxiter)
2569 """
2570 Given a function of one variable and a possible bracket, return
2571 a local minimizer of the function isolated to a fractional precision
(...)
2637
2638 """
2639 options = {'xtol': tol,
2640 'maxiter': maxiter}
-> 2641 res = _minimize_scalar_brent(func, brack, args, **options)
2642 if full_output:
2643 return res['x'], res['fun'], res['nit'], res['nfev']
File ~\anaconda3\envs\***\Lib\site-packages\scipy\optimize\_optimize.py:2678, in _minimize_scalar_brent(func, brack, args, xtol, maxiter, disp, **unknown_options)
2675 brent = Brent(func=func, args=args, tol=tol,
2676 full_output=True, maxiter=maxiter, disp=disp)
2677 brent.set_bracket(brack)
-> 2678 brent.optimize()
2679 x, fval, nit, nfev = brent.get_result(full_output=True)
2681 success = nit < maxiter and not (np.isnan(x) or np.isnan(fval))
File ~\anaconda3\envs\***\Lib\site-packages\scipy\optimize\_optimize.py:2448, in Brent.optimize(self)
2445 def optimize(self):
2446 # set up for optimization
2447 func = self.func
-> 2448 xa, xb, xc, fa, fb, fc, funcalls = self.get_bracket_info()
2449 _mintol = self._mintol
2450 _cg = self._cg
File ~\anaconda3\envs\***\Lib\site-packages\scipy\optimize\_optimize.py:2417, in Brent.get_bracket_info(self)
2415 xa, xb, xc, fa, fb, fc, funcalls = bracket(func, args=args)
2416 elif len(brack) == 2:
-> 2417 xa, xb, xc, fa, fb, fc, funcalls = bracket(func, xa=brack[0],
2418 xb=brack[1], args=args)
2419 elif len(brack) == 3:
2420 xa, xb, xc = brack
File ~\anaconda3\envs\***\Lib\site-packages\scipy\optimize\_optimize.py:3047, in bracket(func, xa, xb, args, grow_limit, maxiter)
3045 e = BracketError(msg)
3046 e.data = (xa, xb, xc, fa, fb, fc, funcalls)
-> 3047 raise e
3049 return xa, xb, xc, fa, fb, fc, funcalls
BracketError: The algorithm terminated without finding a valid bracket. Consider trying different initial points.
Versions
System:
python: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 13:47:18) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\***\anaconda3\envs\***\python.exe
machine: Windows-10-10.0.22621-SP0
Python dependencies:
sklearn: 1.2.2
pip: 23.2.1
setuptools: 68.0.0
numpy: 1.25.2
scipy: 1.11.1
Cython: None
pandas: 2.0.3
matplotlib: 3.7.1
joblib: 1.2.0
threadpoolctl: 2.2.0
Built with OpenMP: True
threadpoolctl info:
filepath: C:\Users\***\anaconda3\envs\***\Library\bin\mkl_rt.2.dll
prefix: mkl_rt
user_api: blas
internal_api: mkl
version: 2023.1-Product
num_threads: 8
threading_layer: intel
filepath: C:\Users\***\anaconda3\envs\***\vcomp140.dll
prefix: vcomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 16
filepath: C:\Users\***\anaconda3\envs\***\Library\bin\libiomp5md.dll
prefix: libiomp
user_api: openmp
internal_api: openmp
version: None
num_threads: 16