-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ignore NaNs in PowerTransformer #11306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ignore NaNs in PowerTransformer #11306
Conversation
The diff will be better once #11206 will be merged. |
My concern with this PR is this line: It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan) |
Why is that concerning? It's got roughly the same complexity as |
@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the following small comment, the PowerTransformer
specific part of this PR LGTM.
sklearn/preprocessing/data.py
Outdated
# get rid of them to compute them. | ||
_, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None) | ||
# FIXME: stats.boxcox should be changed by special.boxcox which | ||
# handles NaN and does not raise warnings. Check SciPy 0.14.X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use special.boxcox
if the installed scipy is recent enough? This will make it easier to drop the backward compat code once we do not support scipy < 0.14.
Ideally, the boxcox
backward compat function that ignores the nan warnings should be extracted in sklearn.utils.fixes
to make it easy to clean up the backward compat code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be so easy to backport special.boxcox... It's something like:
from libc.math cimport log, log1p, expm1, exp, fabs
cimport numpy as np
cdef inline double _func_boxcox(double x, double lmbda) nogil:
# if lmbda << 1 and log(x) < 1.0, the lmbda*log(x) product can lose
# precision, furthermore, expm1(x) == x for x < eps.
# For doubles, the range of log is -744.44 to +709.78, with eps being
# the smallest value produced. This range means that we will have
# abs(lmbda)*log(x) < eps whenever abs(lmbda) <= eps/-log(min double)
# which is ~2.98e-19.
if fabs(lmbda) < 1e-19:
return log(x)
else:
return expm1(lmbda * log(x)) / lmbda
cdef void loop_d_dd__As_dd_d(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
cdef np.npy_intp i, n = dims[0]
cdef void *func = (<void**>data)[0]
cdef char *func_name = <char*>(<void**>data)[1]
cdef char *ip0 = args[0]
cdef char *ip1 = args[1]
cdef char *op0 = args[2]
cdef double ov0
for i in range(n):
ov0 = (<double(*)(double, double) nogil>func)(<double>(<double*>ip0)[0], <double>(<double*>ip1)[0])
(<double *>op0)[0] = <double>ov0
ip0 += steps[0]
ip1 += steps[1]
op0 += steps[2]
sf_error.check_fpe(func_name)
cdef void loop_d_dd__As_ff_f(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
cdef np.npy_intp i, n = dims[0]
cdef void *func = (<void**>data)[0]
cdef char *func_name = <char*>(<void**>data)[1]
cdef char *ip0 = args[0]
cdef char *ip1 = args[1]
cdef char *op0 = args[2]
cdef double ov0
for i in range(n):
ov0 = (<double(*)(double, double) nogil>func)(<double>(<float*>ip0)[0], <double>(<float*>ip1)[0])
(<float *>op0)[0] = <float>ov0
ip0 += steps[0]
ip1 += steps[1]
op0 += steps[2]
sf_error.check_fpe(func_name)
cdef np.PyUFuncGenericFunction ufunc_boxcox_loops[2]
cdef void *ufunc_boxcox_ptr[4]
cdef void *ufunc_boxcox_data[2]
cdef char ufunc_boxcox_types[6]
cdef char *ufunc_boxcox_doc = (
"boxcox(x, lmbda)\n"
"\n"
"Compute the Box-Cox transformation.\n"
"\n"
"The Box-Cox transformation is::\n"
"\n"
" y = (x**lmbda - 1) / lmbda if lmbda != 0\n"
" log(x) if lmbda == 0\n"
"\n"
"Returns `nan` if ``x < 0``.\n"
"Returns `-inf` if ``x == 0`` and ``lmbda < 0``.\n"
"\n"
"Parameters\n"
"----------\n"
"x : array_like\n"
" Data to be transformed.\n"
"lmbda : array_like\n"
" Power parameter of the Box-Cox transform.\n"
"\n"
"Returns\n"
"-------\n"
"y : array\n"
" Transformed data.\n"
"\n"
"Notes\n"
"-----\n"
"\n"
".. versionadded:: 0.14.0\n"
"\n"
"Examples\n"
"--------\n"
">>> from scipy.special import boxcox\n"
">>> boxcox([1, 4, 10], 2.5)\n"
"array([ 0. , 12.4 , 126.09110641])\n"
">>> boxcox(2, [0, 1, 2])\n"
"array([ 0.69314718, 1. , 1.5 ])")
ufunc_boxcox_loops[0] = <np.PyUFuncGenericFunction>loop_d_dd__As_ff_f
ufunc_boxcox_loops[1] = <np.PyUFuncGenericFunction>loop_d_dd__As_dd_d
ufunc_boxcox_types[0] = <char>NPY_FLOAT
ufunc_boxcox_types[1] = <char>NPY_FLOAT
ufunc_boxcox_types[2] = <char>NPY_FLOAT
ufunc_boxcox_types[3] = <char>NPY_DOUBLE
ufunc_boxcox_types[4] = <char>NPY_DOUBLE
ufunc_boxcox_types[5] = <char>NPY_DOUBLE
ufunc_boxcox_ptr[2*0] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*0+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_ptr[2*1] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*1+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_data[0] = &ufunc_boxcox_ptr[2*0]
ufunc_boxcox_data[1] = &ufunc_boxcox_ptr[2*1]
boxcox = np.PyUFunc_FromFuncAndData(ufunc_boxcox_loops, ufunc_boxcox_data, ufunc_boxcox_types, 2, 2, 1, 0, "boxcox", ufunc_boxcox_doc, 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I've now realised you probably aren't suggesting to backport special.boxcox, merely to move these few lines to fixes
. Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :)
This pull request introduces 2 alerts when merging 08960ec into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 2 alerts when merging e626a39 into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Reference Issues/PRs
Towards #10404.
Should be merged after:
What does this implement/fix? Explain your changes.
Any other comments?