Skip to content

[MRG] ignore NaNs in PowerTransformer #11306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jun 21, 2018

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jun 17, 2018

Reference Issues/PRs

Towards #10404.

Should be merged after:

What does this implement/fix? Explain your changes.

Any other comments?

@glemaitre
Copy link
Member Author

The diff will be better once #11206 will be merged.

@glemaitre
Copy link
Member Author

My concern with this PR is this line:

https://github.com/scikit-learn/scikit-learn/pull/11306/files#diff-5ebddebc20987b6125fffc893f5abc4cR2785

It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan)

@jnothman jnothman mentioned this pull request Jun 17, 2018
7 tasks
@jnothman
Copy link
Member

It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan)

Why is that concerning? It's got roughly the same complexity as np.nan*... the only improvement I can see is to not make a data copy if there are no NaNs.

@glemaitre glemaitre changed the title [WIP] ignore NaNs in PowerTransformer [MRG] ignore NaNs in PowerTransformer Jun 18, 2018
@glemaitre glemaitre added this to the 0.20 milestone Jun 18, 2018
@glemaitre
Copy link
Member Author

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the following small comment, the PowerTransformer specific part of this PR LGTM.

# get rid of them to compute them.
_, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None)
# FIXME: stats.boxcox should be changed by special.boxcox which
# handles NaN and does not raise warnings. Check SciPy 0.14.X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use special.boxcox if the installed scipy is recent enough? This will make it easier to drop the backward compat code once we do not support scipy < 0.14.

Ideally, the boxcox backward compat function that ignores the nan warnings should be extracted in sklearn.utils.fixes to make it easy to clean up the backward compat code in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Member

@jnothman jnothman Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be so easy to backport special.boxcox... It's something like:

from libc.math cimport log, log1p, expm1, exp, fabs
cimport numpy as np



cdef inline double _func_boxcox(double x, double lmbda) nogil:
    # if lmbda << 1 and log(x) < 1.0, the lmbda*log(x) product can lose
    # precision, furthermore, expm1(x) == x for x < eps.
    # For doubles, the range of log is -744.44 to +709.78, with eps being
    # the smallest value produced.  This range means that we will have
    # abs(lmbda)*log(x) < eps whenever abs(lmbda) <= eps/-log(min double)
    # which is ~2.98e-19.  
    if fabs(lmbda) < 1e-19:
        return log(x)
    else:
        return expm1(lmbda * log(x)) / lmbda


cdef void loop_d_dd__As_dd_d(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
    cdef np.npy_intp i, n = dims[0]
    cdef void *func = (<void**>data)[0]
    cdef char *func_name = <char*>(<void**>data)[1]
    cdef char *ip0 = args[0]
    cdef char *ip1 = args[1]
    cdef char *op0 = args[2]
    cdef double ov0
    for i in range(n):
        ov0 = (<double(*)(double, double) nogil>func)(<double>(<double*>ip0)[0], <double>(<double*>ip1)[0])
        (<double *>op0)[0] = <double>ov0
        ip0 += steps[0]
        ip1 += steps[1]
        op0 += steps[2]
    sf_error.check_fpe(func_name)

cdef void loop_d_dd__As_ff_f(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil:
    cdef np.npy_intp i, n = dims[0]
    cdef void *func = (<void**>data)[0]
    cdef char *func_name = <char*>(<void**>data)[1]
    cdef char *ip0 = args[0]
    cdef char *ip1 = args[1]
    cdef char *op0 = args[2]
    cdef double ov0
    for i in range(n):
        ov0 = (<double(*)(double, double) nogil>func)(<double>(<float*>ip0)[0], <double>(<float*>ip1)[0])
        (<float *>op0)[0] = <float>ov0
        ip0 += steps[0]
        ip1 += steps[1]
        op0 += steps[2]
    sf_error.check_fpe(func_name)


cdef np.PyUFuncGenericFunction ufunc_boxcox_loops[2]
cdef void *ufunc_boxcox_ptr[4]
cdef void *ufunc_boxcox_data[2]
cdef char ufunc_boxcox_types[6]
cdef char *ufunc_boxcox_doc = (
    "boxcox(x, lmbda)\n"
    "\n"
    "Compute the Box-Cox transformation.\n"
    "\n"
    "The Box-Cox transformation is::\n"
    "\n"
    "    y = (x**lmbda - 1) / lmbda  if lmbda != 0\n"
    "        log(x)                  if lmbda == 0\n"
    "\n"
    "Returns `nan` if ``x < 0``.\n"
    "Returns `-inf` if ``x == 0`` and ``lmbda < 0``.\n"
    "\n"
    "Parameters\n"
    "----------\n"
    "x : array_like\n"
    "    Data to be transformed.\n"
    "lmbda : array_like\n"
    "    Power parameter of the Box-Cox transform.\n"
    "\n"
    "Returns\n"
    "-------\n"
    "y : array\n"
    "    Transformed data.\n"
    "\n"
    "Notes\n"
    "-----\n"
    "\n"
    ".. versionadded:: 0.14.0\n"
    "\n"
    "Examples\n"
    "--------\n"
    ">>> from scipy.special import boxcox\n"
    ">>> boxcox([1, 4, 10], 2.5)\n"
    "array([   0.        ,   12.4       ,  126.09110641])\n"
    ">>> boxcox(2, [0, 1, 2])\n"
    "array([ 0.69314718,  1.        ,  1.5       ])")
ufunc_boxcox_loops[0] = <np.PyUFuncGenericFunction>loop_d_dd__As_ff_f
ufunc_boxcox_loops[1] = <np.PyUFuncGenericFunction>loop_d_dd__As_dd_d
ufunc_boxcox_types[0] = <char>NPY_FLOAT
ufunc_boxcox_types[1] = <char>NPY_FLOAT
ufunc_boxcox_types[2] = <char>NPY_FLOAT
ufunc_boxcox_types[3] = <char>NPY_DOUBLE
ufunc_boxcox_types[4] = <char>NPY_DOUBLE
ufunc_boxcox_types[5] = <char>NPY_DOUBLE
ufunc_boxcox_ptr[2*0] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*0+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_ptr[2*1] = <void*>_func_boxcox
ufunc_boxcox_ptr[2*1+1] = <void*>(<char*>"boxcox")
ufunc_boxcox_data[0] = &ufunc_boxcox_ptr[2*0]
ufunc_boxcox_data[1] = &ufunc_boxcox_ptr[2*1]
boxcox = np.PyUFunc_FromFuncAndData(ufunc_boxcox_loops, ufunc_boxcox_data, ufunc_boxcox_types, 2, 2, 1, 0, "boxcox", ufunc_boxcox_doc, 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I've now realised you probably aren't suggesting to backport special.boxcox, merely to move these few lines to fixes. Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging 08960ec into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging e626a39 into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

@jnothman jnothman merged commit c1bc665 into scikit-learn:master Jun 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants