[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale #11308

glemaitre · 2018-06-18T09:27:24Z

Reference Issues/PRs

Toward #10404

Merge #11011 before due to better test in the common tests.

What does this implement/fix? Explain your changes.

TODO:

RobustScaler should handle better sparse matrix.

Any other comments?

glemaitre · 2018-06-18T12:05:05Z

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

jnothman

Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.

glemaitre · 2018-06-18T12:47:07Z

Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.

Oh so I should materialize the zero.

jnothman · 2018-06-18T13:00:34Z

I think just leave it as not handling the sparse case. The IQR is only likely to be > 0 once density > 20%. So for many sparse matrices, quantiles are not relevant.

glemaitre · 2018-06-18T13:08:43Z

I think just leave it as not handling the sparse case. The IQR is only likely to be > 0 once density > 20%. So for many sparse matrices, quantiles are not relevant.

In this case, it should be mentioned that robust_scale does not work on sparse matrices.
It was actually my original issue :)

glemaitre · 2018-06-18T13:50:28Z

I think just leave it as not handling the sparse case. The IQR is only likely to be > 0 once density > 20%. So for many sparse matrices, quantiles are not relevant.

Actually I have a second thought. More specifically, we should deprecate transform support for sparse matrix. I don't know if it worth or instead compute properly the IQR (in case some people use sparse matrices with density large enough).

jnothman · 2018-06-18T22:01:26Z

sklearn/preprocessing/tests/test_data.py

@@ -905,6 +951,20 @@ def test_robust_scaler_2d_arrays():
    assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


+def test_robust_scaler_equivalence_dense_sparse():
+    # Check the equivalence of the fitting with dense and sparse matrices
+    X_sparse = sparse.rand(1000, 5, density=0.5).tocsc()


To avoid regressions, please make sure this works with an all-positive, all-negative, all-zero, positive, negative and zero, and all-nonzero columns. Check with a couple of densities and also integer dtypes.

NaNs with integer dtype should not work.

jorisvandenbossche

Reminder: update transform docstring

sklearn-lgtm · 2018-06-21T16:20:25Z

This pull request introduces 1 alert when merging de147a4 into e555893 - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

glemaitre · 2018-06-21T20:48:41Z

@jnothman I added a test for different cases of density and signs of the matrix.

sklearn-lgtm · 2018-06-21T21:16:55Z

This pull request introduces 1 alert when merging f884532 into e555893 - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

jnothman · 2018-06-21T21:39:54Z

sklearn/preprocessing/data.py

+            for feature_idx in range(X.shape[1]):
+                if sparse.issparse(X):
+                    column_nnz_data = X.data[X.indptr[feature_idx]:
+                                             X.indptr[feature_idx + 1]]


I assume you realise that there should be a more asymptotically efficient way to handle the sparse case, as it should be easy to work out whether a percentile is zero, positive or negative, then adjust the quantile parameter...

But this is fine in the first instance.

To be honest, I don't know about it.

jnothman · 2018-06-21T21:40:54Z

sklearn/preprocessing/tests/test_data.py

@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays():
    assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


+@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1])


This is now a bit excessive.

jnothman · 2018-06-21T21:44:36Z

sklearn/preprocessing/tests/test_data.py

+    # Check the equivalence of the fitting with dense and sparse matrices
+    X_sparse = sparse.rand(1000, 5, density=density).tocsc()
+    if strictly_signed == 'positif':
+        X_sparse.data += X_sparse.min()


This won't make it positive. Subtracting the min will. So will abs

ogrisel · 2018-06-22T08:40:54Z

sklearn/preprocessing/tests/test_data.py

@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays():
    assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


+@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1])
+@pytest.mark.parametrize("strictly_signed",
+                         ['positif', 'negatif', 'zeros', None])


typo: positive, negative

sklearn-lgtm · 2018-06-22T11:11:30Z

This pull request introduces 1 alert when merging 7f22b3f into 93382cc - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

sklearn-lgtm · 2018-06-25T14:22:03Z

This pull request introduces 1 alert when merging 05a23f6 into eec7649 - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

jnothman

Otherwise LGTM

jnothman · 2018-06-26T00:03:14Z

doc/whats_new/v0.20.rst

@@ -274,6 +274,10 @@ Preprocessing
 - :class:`preprocessing.PowerTransformer` and
  :func:`preprocessing.power_transform` ignore and pass-through NaN values.
  :issue:`11306` by :user:`Guillaume Lemaitre <glemaitre>`.
+- :class:`preprocessing.RobustScaler` and :func:`preprocessing.robust_scale`
+  ignore and pass-through NaN values. In addition, the scaler can now be fitted


I think the second comment is a separate enhancement and should be reported separately.

I would like to see the "ignore and pass through NaN values" what's news merged into one (not necessarily in this PR).

jnothman · 2018-06-26T00:05:21Z

sklearn/preprocessing/data.py

-            self.scale_ = (q[1] - q[0])
+                quantiles.append(
+                    nanpercentile(column_data, self.quantile_range)
+                    if column_data.size else [0, 0])


I think you no longer need this if column_data.size check?

nop because it returns nan otherwise (which is consistent with the numpy function).

In [1]: import numpy as np In [2]: from sklearn.utils.fixes import nanpercentile In [3]: nanpercentile([], [25, 50, 75]) /home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) Out[3]: nan In [4]: np.nanpercentile([], [25, 50, 75]) /home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims) Out[4]: nan

But when would you encounter 0 samples after check_array?

With sparse matrices with all zero column:

In [1]: from scipy import sparse In [3]: from sklearn.utils import check_array In [4]: X = sparse.rand(10, 2, density=0).tocsr() In [5]: check_array(X, accept_sparse=True) Out[5]: <10x2 sparse matrix of type '<class 'numpy.float64'>' with 0 stored elements in Compressed Sparse Row format> In [6]: X.data Out[6]: array([], dtype=float64)

Yes, but you're now only ever passing in something with shape X.shape[0]

Oh sorry, I see now

column_data = np.zeros(shape=X.shape[0], dtype=X.dtype)

I missed this and I was thinking that we were using the data.nnz which can be an empty array.

Making the changes then.

jnothman · 2018-06-26T00:05:30Z

sklearn/preprocessing/data.py

+
+            quantiles = np.transpose(quantiles)
+
+            self.scale_ = (quantiles[1] - quantiles[0])


rm parentheses

sklearn-lgtm · 2018-06-26T16:45:21Z

This pull request introduces 1 alert when merging eecb39a into 3b5abf7 - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

sklearn-lgtm · 2018-06-27T14:55:43Z

This pull request introduces 1 alert when merging 86cf707 into 3b5abf7 - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

jnothman · 2018-06-27T22:40:11Z

Have you cheers that unused import?

glemaitre · 2018-06-28T06:02:11Z

I don't agree with LGTM. This is an import in fixes. Sent from my phone - sorry to be brief and potential misspell.

jnothman · 2018-06-28T08:03:51Z

@ogrisel, good to merge?

ogrisel · 2018-07-05T14:34:51Z

Merged! Thanks, @glemaitre!

mrocklin · 2018-07-05T19:10:54Z

sklearn/preprocessing/data.py

+        check_is_fitted(self, 'center_', 'scale_')
+        X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
+                        estimator=self, dtype=FLOAT_DTYPES,
+                        force_all_finite='allow-nan')


FYI this affected dask-ml. Previously the logic in this transformer was equally applicable to numpy and dask arrays. Now it auto-converts dask arrays to numpy arrays.

EHN accept to fit sparse matrices

b1c5a21

glemaitre force-pushed the nan_robust_scaler branch from 37f7cb5 to b1c5a21 Compare June 18, 2018 09:58

glemaitre added 4 commits June 18, 2018 12:03

fix

196892e

iter

36273e0

TST check attributes and corner case sparse matrix

ed25163

DOC whats new entry

d97fee5

glemaitre changed the title ~~[WIP] Ignore and pass-through NaNs in RobustScaler and robust_scale~~ [MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale Jun 18, 2018

jnothman mentioned this pull request Jun 18, 2018

Disregard NaNs in preprocessing #10404

Closed

7 tasks

glemaitre added this to the 0.20 milestone Jun 18, 2018

jnothman reviewed Jun 18, 2018

View reviewed changes

glemaitre added 2 commits June 18, 2018 15:58

TST check equivalence between sparse and dense

296bfd0

FIX back-port nanmedian

b6f1df6

jnothman reviewed Jun 18, 2018

View reviewed changes

jorisvandenbossche reviewed Jun 19, 2018

View reviewed changes

Merge remote-tracking branch 'origin/master' into nan_robust_scaler

de147a4

TST add more test case for sparse matrices

db7cb4b

TST additional test for random sparse matrix

f884532

jnothman reviewed Jun 21, 2018

View reviewed changes

ogrisel reviewed Jun 22, 2018

View reviewed changes

glemaitre added 2 commits June 22, 2018 12:41

Merge remote-tracking branch 'origin/master' into nan_robust_scaler

0e4e8de

address comments

7f22b3f

jnothman approved these changes Jun 26, 2018

View reviewed changes

glemaitre added 2 commits June 26, 2018 18:03

Merge remote-tracking branch 'origin/master' into nan_robust_scaler

cea6b8b

address comments

eecb39a

glemaitre force-pushed the nan_robust_scaler branch from 05a23f6 to eecb39a Compare June 26, 2018 16:20

glemaitre added 2 commits June 27, 2018 16:05

joel comments

02a9811

Update data.py

86cf707

ogrisel merged commit f8adfa2 into scikit-learn:master Jul 5, 2018

stsievert mentioned this pull request Jul 5, 2018

MAINT: randomly sample chunks for partial_fit calls dask/dask-ml#276

Merged

mrocklin reviewed Jul 5, 2018

View reviewed changes

mrocklin mentioned this pull request Jul 5, 2018

Avoid np.asarray call in check_array for duck-typed arrays #11447

Closed

		@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays():
		assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


		@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1])


		quantiles = np.transpose(quantiles)

		self.scale_ = (quantiles[1] - quantiles[0])

Uh oh!

[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale #11308

[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale #11308

Uh oh!

Conversation

glemaitre commented Jun 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre commented Jun 18, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jun 18, 2018

Uh oh!

jnothman commented Jun 18, 2018 via email

Uh oh!

glemaitre commented Jun 18, 2018

Uh oh!

glemaitre commented Jun 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

sklearn-lgtm commented Jun 21, 2018

Uh oh!

glemaitre commented Jun 21, 2018

Uh oh!

sklearn-lgtm commented Jun 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sklearn-lgtm commented Jun 22, 2018

Uh oh!

sklearn-lgtm commented Jun 25, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sklearn-lgtm commented Jun 26, 2018

Uh oh!

sklearn-lgtm commented Jun 27, 2018

Uh oh!

jnothman commented Jun 27, 2018

Uh oh!

glemaitre commented Jun 28, 2018 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jun 28, 2018

glemaitre commented Jun 18, 2018 •

edited

Loading

glemaitre commented Jun 28, 2018 via email •

edited

Loading