Skip to content

[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale #11308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 5, 2018

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jun 18, 2018

Reference Issues/PRs

Toward #10404

Merge #11011 before due to better test in the common tests.

What does this implement/fix? Explain your changes.

TODO:

  • RobustScaler should handle better sparse matrix.

Any other comments?

@glemaitre glemaitre force-pushed the nan_robust_scaler branch from 37f7cb5 to b1c5a21 Compare June 18, 2018 09:58
@glemaitre glemaitre changed the title [WIP] Ignore and pass-through NaNs in RobustScaler and robust_scale [MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale Jun 18, 2018
@jnothman jnothman mentioned this pull request Jun 18, 2018
7 tasks
@glemaitre glemaitre added this to the 0.20 milestone Jun 18, 2018
@glemaitre
Copy link
Member Author

@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014
I would appreciate reviews.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.

@glemaitre
Copy link
Member Author

Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.

Oh so I should materialize the zero.

@jnothman
Copy link
Member

jnothman commented Jun 18, 2018 via email

@glemaitre
Copy link
Member Author

I think just leave it as not handling the sparse case. The IQR is only likely to be > 0 once density > 20%. So for many sparse matrices, quantiles are not relevant.​

In this case, it should be mentioned that robust_scale does not work on sparse matrices.
It was actually my original issue :)

@glemaitre
Copy link
Member Author

I think just leave it as not handling the sparse case. The IQR is only likely to be > 0 once density > 20%. So for many sparse matrices, quantiles are not relevant.​

Actually I have a second thought. More specifically, we should deprecate transform support for sparse matrix. I don't know if it worth or instead compute properly the IQR (in case some people use sparse matrices with density large enough).

@@ -905,6 +951,20 @@ def test_robust_scaler_2d_arrays():
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


def test_robust_scaler_equivalence_dense_sparse():
# Check the equivalence of the fitting with dense and sparse matrices
X_sparse = sparse.rand(1000, 5, density=0.5).tocsc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid regressions, please make sure this works with an all-positive, all-negative, all-zero, positive, negative and zero, and all-nonzero columns. Check with a couple of densities and also integer dtypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaNs with integer dtype should not work.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder: update transform docstring

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging de147a4 into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@glemaitre
Copy link
Member Author

@jnothman I added a test for different cases of density and signs of the matrix.

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging f884532 into e555893 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

for feature_idx in range(X.shape[1]):
if sparse.issparse(X):
column_nnz_data = X.data[X.indptr[feature_idx]:
X.indptr[feature_idx + 1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you realise that there should be a more asymptotically efficient way to handle the sparse case, as it should be easy to work out whether a percentile is zero, positive or negative, then adjust the quantile parameter...

But this is fine in the first instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't know about it.

@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays():
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a bit excessive.

# Check the equivalence of the fitting with dense and sparse matrices
X_sparse = sparse.rand(1000, 5, density=density).tocsc()
if strictly_signed == 'positif':
X_sparse.data += X_sparse.min()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't make it positive. Subtracting the min will. So will abs

@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays():
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0)


@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1])
@pytest.mark.parametrize("strictly_signed",
['positif', 'negatif', 'zeros', None])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: positive, negative

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 7f22b3f into 93382cc - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 05a23f6 into eec7649 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -274,6 +274,10 @@ Preprocessing
- :class:`preprocessing.PowerTransformer` and
:func:`preprocessing.power_transform` ignore and pass-through NaN values.
:issue:`11306` by :user:`Guillaume Lemaitre <glemaitre>`.
- :class:`preprocessing.RobustScaler` and :func:`preprocessing.robust_scale`
ignore and pass-through NaN values. In addition, the scaler can now be fitted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second comment is a separate enhancement and should be reported separately.

I would like to see the "ignore and pass through NaN values" what's news merged into one (not necessarily in this PR).

self.scale_ = (q[1] - q[0])
quantiles.append(
nanpercentile(column_data, self.quantile_range)
if column_data.size else [0, 0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you no longer need this if column_data.size check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop because it returns nan otherwise (which is consistent with the numpy function).

In [1]: import numpy as np

In [2]: from sklearn.utils.fixes import nanpercentile

In [3]: nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[3]: nan

In [4]: np.nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[4]: nan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when would you encounter 0 samples after check_array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With sparse matrices with all zero column:

In [1]: from scipy import sparse

In [3]: from sklearn.utils import check_array

In [4]: X = sparse.rand(10, 2, density=0).tocsr()

In [5]: check_array(X, accept_sparse=True)
Out[5]: 
<10x2 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [6]: X.data
Out[6]: array([], dtype=float64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you're now only ever passing in something with shape X.shape[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I see now

column_data = np.zeros(shape=X.shape[0], dtype=X.dtype)

I missed this and I was thinking that we were using the data.nnz which can be an empty array.

Making the changes then.


quantiles = np.transpose(quantiles)

self.scale_ = (quantiles[1] - quantiles[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm parentheses

@glemaitre glemaitre force-pushed the nan_robust_scaler branch from 05a23f6 to eecb39a Compare June 26, 2018 16:20
@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging eecb39a into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 86cf707 into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@jnothman
Copy link
Member

Have you cheers that unused import?

@glemaitre
Copy link
Member Author

glemaitre commented Jun 28, 2018 via email

@jnothman
Copy link
Member

@ogrisel, good to merge?

@ogrisel ogrisel merged commit f8adfa2 into scikit-learn:master Jul 5, 2018
@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2018

Merged! Thanks, @glemaitre!

check_is_fitted(self, 'center_', 'scale_')
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
estimator=self, dtype=FLOAT_DTYPES,
force_all_finite='allow-nan')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this affected dask-ml. Previously the logic in this transformer was equally applicable to numpy and dask arrays. Now it auto-converts dask arrays to numpy arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants