-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Ignore and pass-through NaNs in RobustScaler and robust_scale #11308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
37f7cb5
to
b1c5a21
Compare
@ogrisel @jorisvandenbossche @jeremiedbb @lesteve @rth @amueller @qinhanmin2014 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your sparse adaptation doesn't work. The qth quantile of non-zero values is not the qth quantile of values including (majority) zeros.
Oh so I should materialize the zero. |
I think just leave it as not handling the sparse case. The IQR is only
likely to be > 0 once density > 20%. So for many sparse matrices, quantiles
are not relevant.
|
In this case, it should be mentioned that |
Actually I have a second thought. More specifically, we should deprecate |
@@ -905,6 +951,20 @@ def test_robust_scaler_2d_arrays(): | |||
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0) | |||
|
|||
|
|||
def test_robust_scaler_equivalence_dense_sparse(): | |||
# Check the equivalence of the fitting with dense and sparse matrices | |||
X_sparse = sparse.rand(1000, 5, density=0.5).tocsc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid regressions, please make sure this works with an all-positive, all-negative, all-zero, positive, negative and zero, and all-nonzero columns. Check with a couple of densities and also integer dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaNs with integer dtype should not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder: update transform
docstring
This pull request introduces 1 alert when merging de147a4 into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
@jnothman I added a test for different cases of density and signs of the matrix. |
This pull request introduces 1 alert when merging f884532 into e555893 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
for feature_idx in range(X.shape[1]): | ||
if sparse.issparse(X): | ||
column_nnz_data = X.data[X.indptr[feature_idx]: | ||
X.indptr[feature_idx + 1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you realise that there should be a more asymptotically efficient way to handle the sparse case, as it should be easy to work out whether a percentile is zero, positive or negative, then adjust the quantile parameter...
But this is fine in the first instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't know about it.
@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays(): | |||
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0) | |||
|
|||
|
|||
@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a bit excessive.
# Check the equivalence of the fitting with dense and sparse matrices | ||
X_sparse = sparse.rand(1000, 5, density=density).tocsc() | ||
if strictly_signed == 'positif': | ||
X_sparse.data += X_sparse.min() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't make it positive. Subtracting the min will. So will abs
@@ -919,6 +965,29 @@ def test_robust_scaler_2d_arrays(): | |||
assert_array_almost_equal(X_scaled.std(axis=0)[0], 0) | |||
|
|||
|
|||
@pytest.mark.parametrize("density", [0, 0.01, 0.05, 0.1, 0.2, 0.5, 1]) | |||
@pytest.mark.parametrize("strictly_signed", | |||
['positif', 'negatif', 'zeros', None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: positive, negative
This pull request introduces 1 alert when merging 7f22b3f into 93382cc - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 1 alert when merging 05a23f6 into eec7649 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
doc/whats_new/v0.20.rst
Outdated
@@ -274,6 +274,10 @@ Preprocessing | |||
- :class:`preprocessing.PowerTransformer` and | |||
:func:`preprocessing.power_transform` ignore and pass-through NaN values. | |||
:issue:`11306` by :user:`Guillaume Lemaitre <glemaitre>`. | |||
- :class:`preprocessing.RobustScaler` and :func:`preprocessing.robust_scale` | |||
ignore and pass-through NaN values. In addition, the scaler can now be fitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second comment is a separate enhancement and should be reported separately.
I would like to see the "ignore and pass through NaN values" what's news merged into one (not necessarily in this PR).
sklearn/preprocessing/data.py
Outdated
self.scale_ = (q[1] - q[0]) | ||
quantiles.append( | ||
nanpercentile(column_data, self.quantile_range) | ||
if column_data.size else [0, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you no longer need this if column_data.size
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nop because it returns nan otherwise (which is consistent with the numpy function).
In [1]: import numpy as np
In [2]: from sklearn.utils.fixes import nanpercentile
In [3]: nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[3]: nan
In [4]: np.nanpercentile([], [25, 50, 75])
/home/lemaitre/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/lib/nanfunctions.py:1148: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[4]: nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But when would you encounter 0 samples after check_array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With sparse matrices with all zero column:
In [1]: from scipy import sparse
In [3]: from sklearn.utils import check_array
In [4]: X = sparse.rand(10, 2, density=0).tocsr()
In [5]: check_array(X, accept_sparse=True)
Out[5]:
<10x2 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
In [6]: X.data
Out[6]: array([], dtype=float64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you're now only ever passing in something with shape X.shape[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I see now
column_data = np.zeros(shape=X.shape[0], dtype=X.dtype)
I missed this and I was thinking that we were using the data.nnz
which can be an empty array.
Making the changes then.
sklearn/preprocessing/data.py
Outdated
|
||
quantiles = np.transpose(quantiles) | ||
|
||
self.scale_ = (quantiles[1] - quantiles[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm parentheses
05a23f6
to
eecb39a
Compare
This pull request introduces 1 alert when merging eecb39a into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 1 alert when merging 86cf707 into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Have you cheers that unused import? |
I don't agree with LGTM. This is an import in fixes. Sent from my phone - sorry to be brief and potential misspell.
|
@ogrisel, good to merge? |
Merged! Thanks, @glemaitre! |
check_is_fitted(self, 'center_', 'scale_') | ||
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy, | ||
estimator=self, dtype=FLOAT_DTYPES, | ||
force_all_finite='allow-nan') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this affected dask-ml. Previously the logic in this transformer was equally applicable to numpy and dask arrays. Now it auto-converts dask arrays to numpy arrays.
Reference Issues/PRs
Toward #10404
Merge #11011 before due to better test in the common tests.
What does this implement/fix? Explain your changes.
TODO:
Any other comments?