[MRG] ignore NaNs in PowerTransformer #11306

glemaitre · 2018-06-17T21:56:46Z

Reference Issues/PRs

Towards #10404.

Should be merged after:

[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206 (depends on the StandardScaler)
[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011 (better test)

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre · 2018-06-17T22:04:01Z

The diff will be better once #11206 will be merged.

glemaitre · 2018-06-17T22:05:19Z

My concern with this PR is this line:

https://github.com/scikit-learn/scikit-learn/pull/11306/files#diff-5ebddebc20987b6125fffc893f5abc4cR2785

It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan)

jnothman · 2018-06-17T23:28:09Z

It seems that the computation of the lambdas is not robust to nan values. Therefore, we need to slice the column to find the lambdas and then compute the using boxcox (which ignore nan)

Why is that concerning? It's got roughly the same complexity as np.nan*... the only improvement I can see is to not make a data copy if there are no NaNs.

glemaitre · 2018-06-18T12:05:10Z

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

ogrisel

Besides the following small comment, the PowerTransformer specific part of this PR LGTM.

ogrisel · 2018-06-21T12:29:20Z

sklearn/preprocessing/data.py

+            # get rid of them to compute them.
+            _, lmbda = stats.boxcox(col[~np.isnan(col)], lmbda=None)
+            # FIXME: stats.boxcox should be changed by special.boxcox which
+            # handles NaN and does not raise warnings. Check SciPy 0.14.X


Why not use special.boxcox if the installed scipy is recent enough? This will make it easier to drop the backward compat code once we do not support scipy < 0.14.

Ideally, the boxcox backward compat function that ignores the nan warnings should be extracted in sklearn.utils.fixes to make it easy to clean up the backward compat code in the future.

It may not be so easy to backport special.boxcox... It's something like:

from libc.math cimport log, log1p, expm1, exp, fabs cimport numpy as np cdef inline double _func_boxcox(double x, double lmbda) nogil: # if lmbda << 1 and log(x) < 1.0, the lmbda*log(x) product can lose # precision, furthermore, expm1(x) == x for x < eps. # For doubles, the range of log is -744.44 to +709.78, with eps being # the smallest value produced. This range means that we will have # abs(lmbda)*log(x) < eps whenever abs(lmbda) <= eps/-log(min double) # which is ~2.98e-19. if fabs(lmbda) < 1e-19: return log(x) else: return expm1(lmbda * log(x)) / lmbda cdef void loop_d_dd__As_dd_d(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil: cdef np.npy_intp i, n = dims[0] cdef void *func = (<void**>data)[0] cdef char *func_name = <char*>(<void**>data)[1] cdef char *ip0 = args[0] cdef char *ip1 = args[1] cdef char *op0 = args[2] cdef double ov0 for i in range(n): ov0 = (<double(*)(double, double) nogil>func)(<double>(<double*>ip0)[0], <double>(<double*>ip1)[0]) (<double *>op0)[0] = <double>ov0 ip0 += steps[0] ip1 += steps[1] op0 += steps[2] sf_error.check_fpe(func_name) cdef void loop_d_dd__As_ff_f(char **args, np.npy_intp *dims, np.npy_intp *steps, void *data) nogil: cdef np.npy_intp i, n = dims[0] cdef void *func = (<void**>data)[0] cdef char *func_name = <char*>(<void**>data)[1] cdef char *ip0 = args[0] cdef char *ip1 = args[1] cdef char *op0 = args[2] cdef double ov0 for i in range(n): ov0 = (<double(*)(double, double) nogil>func)(<double>(<float*>ip0)[0], <double>(<float*>ip1)[0]) (<float *>op0)[0] = <float>ov0 ip0 += steps[0] ip1 += steps[1] op0 += steps[2] sf_error.check_fpe(func_name) cdef np.PyUFuncGenericFunction ufunc_boxcox_loops[2] cdef void *ufunc_boxcox_ptr[4] cdef void *ufunc_boxcox_data[2] cdef char ufunc_boxcox_types[6] cdef char *ufunc_boxcox_doc = ( "boxcox(x, lmbda)\n" "\n" "Compute the Box-Cox transformation.\n" "\n" "The Box-Cox transformation is::\n" "\n" " y = (x**lmbda - 1) / lmbda if lmbda != 0\n" " log(x) if lmbda == 0\n" "\n" "Returns `nan` if ``x < 0``.\n" "Returns `-inf` if ``x == 0`` and ``lmbda < 0``.\n" "\n" "Parameters\n" "----------\n" "x : array_like\n" " Data to be transformed.\n" "lmbda : array_like\n" " Power parameter of the Box-Cox transform.\n" "\n" "Returns\n" "-------\n" "y : array\n" " Transformed data.\n" "\n" "Notes\n" "-----\n" "\n" ".. versionadded:: 0.14.0\n" "\n" "Examples\n" "--------\n" ">>> from scipy.special import boxcox\n" ">>> boxcox([1, 4, 10], 2.5)\n" "array([ 0. , 12.4 , 126.09110641])\n" ">>> boxcox(2, [0, 1, 2])\n" "array([ 0.69314718, 1. , 1.5 ])") ufunc_boxcox_loops[0] = <np.PyUFuncGenericFunction>loop_d_dd__As_ff_f ufunc_boxcox_loops[1] = <np.PyUFuncGenericFunction>loop_d_dd__As_dd_d ufunc_boxcox_types[0] = <char>NPY_FLOAT ufunc_boxcox_types[1] = <char>NPY_FLOAT ufunc_boxcox_types[2] = <char>NPY_FLOAT ufunc_boxcox_types[3] = <char>NPY_DOUBLE ufunc_boxcox_types[4] = <char>NPY_DOUBLE ufunc_boxcox_types[5] = <char>NPY_DOUBLE ufunc_boxcox_ptr[2*0] = <void*>_func_boxcox ufunc_boxcox_ptr[2*0+1] = <void*>(<char*>"boxcox") ufunc_boxcox_ptr[2*1] = <void*>_func_boxcox ufunc_boxcox_ptr[2*1+1] = <void*>(<char*>"boxcox") ufunc_boxcox_data[0] = &ufunc_boxcox_ptr[2*0] ufunc_boxcox_data[1] = &ufunc_boxcox_ptr[2*1] boxcox = np.PyUFunc_FromFuncAndData(ufunc_boxcox_loops, ufunc_boxcox_data, ufunc_boxcox_types, 2, 2, 1, 0, "boxcox", ufunc_boxcox_doc, 0)

But I've now realised you probably aren't suggesting to backport special.boxcox, merely to move these few lines to fixes. Yes.

sklearn-lgtm · 2018-06-21T16:12:58Z

This pull request introduces 2 alerts when merging 08960ec into e555893 - view on LGTM.com

new alerts:

1 for Unused import
1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

sklearn-lgtm · 2018-06-21T21:17:23Z

This pull request introduces 2 alerts when merging e626a39 into e555893 - view on LGTM.com

new alerts:

1 for Unused import
1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

glemaitre added 28 commits June 5, 2018 18:07

EHN ignore NaN when incrementing mean and var

5dba57a

EHN ignore NaNs in StandardScaler for dense case

591c18f

EHN should handle sparse case

43fa54b

TST launch common tests

faab12a

FIX use loop

fa12fe7

FIX number of samples first iteration

4d60bfe

FIX use proper sparse constructor

eba3087

BUG wrong index and remove nan counter

1782e69

revert

0b074ce

cleanup

a6cb76d

FIX use sum on the boolean array

ab2c465

TST equivalance function and class

76691a9

Merge remote-tracking branch 'origin/master' into nan_standardscaler

d5ece66

Merge remote-tracking branch 'origin/master' into nan_standardscaler

504323c

backward compatibility for n_samples_seen_

8d13da1

spelling

124742b

TST revert some test for back compatibility

2ffe497

TST check NaN are ignored in incremental_mean_and_variance

424fdba

TST check NaNs ignore in incr_mean_variance

0fe8a3e

OPTIM cython variable typing

f267a35

DOC corrections

4785fb2

DOC mentioned that NaNs are ignored in Notes

082633d

TST shape of n_samples_seen with missing values

d79b867

DOC fix spelling

cb077ea

DOC whats new entry

449d24a

tmp

48d70b0

ignore nan by slicing

43cbe9a

iter

4019c3b

TST mark as allowing nan

9cbb7e8

jnothman mentioned this pull request Jun 17, 2018

Disregard NaNs in preprocessing #10404

Closed

7 tasks

glemaitre added 2 commits June 18, 2018 11:26

COMPAT scipy lower than 0.14

45f4dfd

Update data.py

0072288

glemaitre changed the title ~~[WIP] ignore NaNs in PowerTransformer~~ [MRG] ignore NaNs in PowerTransformer Jun 18, 2018

glemaitre added this to the 0.20 milestone Jun 18, 2018

ogrisel approved these changes Jun 21, 2018

View reviewed changes

glemaitre added 2 commits June 21, 2018 17:29

Merge remote-tracking branch 'origin/master' into nan_powertransformer

3b29485

EHN: add boxcox in fixes

08960ec

PEP8

e626a39

jnothman approved these changes Jun 21, 2018

View reviewed changes

jnothman merged commit c1bc665 into scikit-learn:master Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] ignore NaNs in PowerTransformer #11306

[MRG] ignore NaNs in PowerTransformer #11306

glemaitre commented Jun 17, 2018 •

edited

Loading

glemaitre commented Jun 17, 2018

glemaitre commented Jun 17, 2018

jnothman commented Jun 17, 2018

glemaitre commented Jun 18, 2018

ogrisel left a comment

ogrisel Jun 21, 2018

glemaitre Jun 21, 2018

jnothman Jun 21, 2018 •

edited

Loading

jnothman Jun 21, 2018

ogrisel Jun 21, 2018

sklearn-lgtm commented Jun 21, 2018

sklearn-lgtm commented Jun 21, 2018

[MRG] ignore NaNs in PowerTransformer #11306

[MRG] ignore NaNs in PowerTransformer #11306

Conversation

glemaitre commented Jun 17, 2018 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre commented Jun 17, 2018

glemaitre commented Jun 17, 2018

jnothman commented Jun 17, 2018

glemaitre commented Jun 18, 2018

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Jun 21, 2018

Choose a reason for hiding this comment

glemaitre Jun 21, 2018

Choose a reason for hiding this comment

jnothman Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

jnothman Jun 21, 2018

Choose a reason for hiding this comment

ogrisel Jun 21, 2018

Choose a reason for hiding this comment

sklearn-lgtm commented Jun 21, 2018

sklearn-lgtm commented Jun 21, 2018

glemaitre commented Jun 17, 2018 •

edited

Loading

jnothman Jun 21, 2018 •

edited

Loading