Skip to content

[MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler #11235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 14, 2018

Conversation

glemaitre
Copy link
Member

Reference Issues/PRs

closes #11234

What does this implement/fix? Explain your changes.

Ensure that the mean_ attributes and n_samples_seen_ are the same in the sparse and dense cases with StandardScaler(with_mean=False, with_std=False)

Any other comments?

@glemaitre
Copy link
Member Author

@jnothman I would just correct the inconsistencies for the moment. It allows me to go further in the ignoring NaNs PR.

@glemaitre glemaitre changed the title FIX enforce consistency between dense and sparse cases in StandardScaler [MRG] FIX enforce consistency between dense and sparse cases in StandardScaler Jun 11, 2018
@@ -701,6 +703,62 @@ def test_scaler_without_centering():
assert_array_almost_equal(X_csc_scaled_back.toarray(), X)


def _check_attributes_scalers(scaler_1, scaler_2):
assert scaler_1.mean_ == scaler_2.mean_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need array equality?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since that this is for the case where mean_ will be None, I don't think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather make the check explicit in this case:

assert scaler_1.mean_ is scaler_2.mean_ is None

(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)

@@ -504,6 +504,10 @@ Preprocessing
when returning a sparse matrix output. :issue:`11042` by :user:`Daniel
Morales <DanielMorales9>`.

- Fix inconsistency between sparse and dense case in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this merely an API inconsistency, or will it affect users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running twice fit with sparse matrix and with_mean=False and with_std=False, it was crashing. That's why I thought it would be a bug fix.

Do you prefer to move it to API change summary (or not document it in what's new?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in the dense case we change mean_ from returning an array to returning None which seems more logic and what is already happening for var_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just trying to work out how to make clear to users how this change will affect them. I don't think your message makes that clear at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, is this formulation better

Fix inconsistencies in :class:`preprocessing.StandardScaler` with `with_mean=False`
and `with_std=False`. ``mean_`` will be set to ``None`` with both sparse and dense
inputs. ``n_samples_seen_`` will be also reported for both input types.

@jnothman
Copy link
Member

jnothman commented Jun 12, 2018 via email

@glemaitre
Copy link
Member Author

Because ``n_samples_seen_` was not computed.

@jnothman
Copy link
Member

jnothman commented Jun 12, 2018 via email

@glemaitre
Copy link
Member Author

So saying that you fixed partial_fit in StandardScaler would be much more practically informative.​

Agreed. I change the entry.

@jnothman
Copy link
Member

jnothman commented Jun 12, 2018 via email

@glemaitre
Copy link
Member Author

@ogrisel if you are around, I'd like some feedback :)

@glemaitre glemaitre changed the title [MRG] FIX enforce consistency between dense and sparse cases in StandardScaler [MRG] FIX: count n_samples_seen in fit and partial_fit in StandardScaler Jun 12, 2018
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments on the tests. The change itself looks fine to me.

@@ -701,6 +703,62 @@ def test_scaler_without_centering():
assert_array_almost_equal(X_csc_scaled_back.toarray(), X)


def _check_attributes_scalers(scaler_1, scaler_2):
assert scaler_1.mean_ == scaler_2.mean_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather make the check explicit in this case:

assert scaler_1.mean_ is scaler_2.mean_ is None

(I did not know that such ternary identity assertions would be valid python but it seem to be the case :)

X_trans_csc = transformer_csc.fit_transform(X_csc)

assert_array_almost_equal(X_trans_csr.A, X_csr.A)
assert_array_almost_equal(X_trans_csc.A, X_csc.A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this .A attributed on sparse matrices? It's not documented in the docstring. I would rather use the more explicit: X_trans_csr.toarray(), X_csr.toarray() instead.

Also, didn't we decide to favor assert_allclose instead of assert_array_almost_equal?

Also here X_dense has an integer dtype. I think We should rather make this test on np.float32 or np.float64.

If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False to keep consistency. But if you disagree, I can probably be convinced otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we test on integers, we should in addition check that we get the expected dtype. For standard scaling I would expect to always get floating point values on the output, even with with_mean=False, with_std=False to keep consistency. But if you disagree, I can probably be convinced otherwise.

This is a good point to keep in mind. However, I would delegate this part to the issues/PRs which try to preserve the dtype: #11000

@@ -701,6 +703,62 @@ def test_scaler_without_centering():
assert_array_almost_equal(X_csc_scaled_back.toarray(), X)


def _check_attributes_scalers(scaler_1, scaler_2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: This sounds French to me. _check_scalers_attributes is more natural I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if the function is only valid for the identity case, we should probably reflect that in the function name:

_check_identity_scalers_attributes

@glemaitre glemaitre changed the title [MRG] FIX: count n_samples_seen in fit and partial_fit in StandardScaler [MRG] FIX: enforce consistency between dense and sparse cases in StandardScaler Jun 12, 2018
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ogrisel ogrisel changed the title [MRG] FIX: enforce consistency between dense and sparse cases in StandardScaler [MRG+1] FIX: enforce consistency between dense and sparse cases in StandardScaler Jun 13, 2018
@ogrisel ogrisel added this to the 0.20 milestone Jun 13, 2018
@@ -504,6 +504,14 @@ Preprocessing
when returning a sparse matrix output. :issue:`11042` by :user:`Daniel
Morales <DanielMorales9>`.

- Fix ``fit`` and ``partial_fit`` in :class:`preprocessing.StandardScaler` with
`with_mean=False` and `with_std=False` which was crashing by calling ``fit``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that this is a rare case so readers don't worry about the change

@glemaitre
Copy link
Member Author

@jnothman @ogrisel Last comments addressed. Good to be merged :)

@jnothman jnothman merged commit a4f8e3d into scikit-learn:master Jun 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consistency issue in StandardScaler
3 participants