-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler #11235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman I would just correct the inconsistencies for the moment. It allows me to go further in the ignoring NaNs PR. |
@@ -701,6 +703,62 @@ def test_scaler_without_centering(): | |||
assert_array_almost_equal(X_csc_scaled_back.toarray(), X) | |||
|
|||
|
|||
def _check_attributes_scalers(scaler_1, scaler_2): | |||
assert scaler_1.mean_ == scaler_2.mean_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need array equality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since that this is for the case where mean_
will be None, I don't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather make the check explicit in this case:
assert scaler_1.mean_ is scaler_2.mean_ is None
(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)
doc/whats_new/v0.20.rst
Outdated
@@ -504,6 +504,10 @@ Preprocessing | |||
when returning a sparse matrix output. :issue:`11042` by :user:`Daniel | |||
Morales <DanielMorales9>`. | |||
|
|||
- Fix inconsistency between sparse and dense case in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this merely an API inconsistency, or will it affect users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running twice fit
with sparse matrix and with_mean=False
and with_std=False
, it was crashing. That's why I thought it would be a bug fix.
Do you prefer to move it to API change summary (or not document it in what's new?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in the dense case we change mean_
from returning an array to returning None
which seems more logic and what is already happening for var_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to work out how to make clear to users how this change will affect them. I don't think your message makes that clear at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, is this formulation better
Fix inconsistencies in :class:`preprocessing.StandardScaler` with `with_mean=False`
and `with_std=False`. ``mean_`` will be set to ``None`` with both sparse and dense
inputs. ``n_samples_seen_`` will be also reported for both input types.
I suppose, but you also said there was a crash in the sparse case with
multiple calls...???
…On Tue, 12 Jun 2018 at 22:40, Guillaume Lemaitre ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/whats_new/v0.20.rst
<#11235 (comment)>
:
> @@ -504,6 +504,10 @@ Preprocessing
when returning a sparse matrix output. :issue:`11042` by :user:`Daniel
Morales <DanielMorales9>`.
+- Fix inconsistency between sparse and dense case in
Oh ok, is this formulation better
Fix inconsistencies in :class:`preprocessing.StandardScaler` with `with_mean=False`
and `with_std=False`. ``mean_`` will be set to ``None`` with both sparse and dense
inputs. ``n_samples_seen_`` will be also reported for both input types.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11235 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67iAwl2LbLUevLLnwc3epCUIIw83ks5t77bTgaJpZM4UioOU>
.
|
Because ``n_samples_seen_` was not computed. |
So saying that you fixed partial_fit in StandardScaler would be much more
practically informative.
|
Agreed. I change the entry. |
Thanks :)
…On Tue, 12 Jun 2018 at 22:58, Guillaume Lemaitre ***@***.***> wrote:
So saying that you fixed partial_fit in StandardScaler would be much more
practically informative.
Agreed. I change the entry.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11235 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zvbfdZ5Upt5wZn3uQ7Ujf4Q9oxyks5t77rxgaJpZM4UioOU>
.
|
@ogrisel if you are around, I'd like some feedback :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments on the tests. The change itself looks fine to me.
@@ -701,6 +703,62 @@ def test_scaler_without_centering(): | |||
assert_array_almost_equal(X_csc_scaled_back.toarray(), X) | |||
|
|||
|
|||
def _check_attributes_scalers(scaler_1, scaler_2): | |||
assert scaler_1.mean_ == scaler_2.mean_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather make the check explicit in this case:
assert scaler_1.mean_ is scaler_2.mean_ is None
(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)
X_trans_csc = transformer_csc.fit_transform(X_csc) | ||
|
||
assert_array_almost_equal(X_trans_csr.A, X_csr.A) | ||
assert_array_almost_equal(X_trans_csc.A, X_csc.A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this .A
attributed on sparse matrices? It's not documented in the docstring. I would rather use the more explicit: X_trans_csr.toarray(), X_csr.toarray()
instead.
Also, didn't we decide to favor assert_allclose
instead of assert_array_almost_equal
?
Also here X_dense has an integer dtype. I think We should rather make this test on np.float32
or np.float64
.
If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False
to keep consistency. But if you disagree, I can probably be convinced otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False to keep consistency. But if you disagree, I can probably be convinced otherwise.
This is a good point to keep in mind. However, I would delegate this part to the issues/PRs which try to preserve the dtype: #11000
@@ -701,6 +703,62 @@ def test_scaler_without_centering(): | |||
assert_array_almost_equal(X_csc_scaled_back.toarray(), X) | |||
|
|||
|
|||
def _check_attributes_scalers(scaler_1, scaler_2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: This sounds French to me. _check_scalers_attributes
is more natural I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if the function is only valid for the identity case, we should probably reflect that in the function name:
_check_identity_scalers_attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
doc/whats_new/v0.20.rst
Outdated
@@ -504,6 +504,14 @@ Preprocessing | |||
when returning a sparse matrix output. :issue:`11042` by :user:`Daniel | |||
Morales <DanielMorales9>`. | |||
|
|||
- Fix ``fit`` and ``partial_fit`` in :class:`preprocessing.StandardScaler` with | |||
`with_mean=False` and `with_std=False` which was crashing by calling ``fit`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note that this is a rare case so readers don't worry about the change
Reference Issues/PRs
closes #11234
What does this implement/fix? Explain your changes.
Ensure that the
mean_
attributes andn_samples_seen_
are the same in the sparse and dense cases withStandardScaler(with_mean=False, with_std=False)
Any other comments?