Skip to content

[MRG] DOC Mention StandardScaler ddof #12950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MarcoGorelli
Copy link
Contributor

Reference Issues/PRs

Fixes #7757

What does this implement/fix? Explain your changes.

Expands the documentation so it's clear that the estimate of the standard deviation in StandardScaler is the biased one (equivalent to numpy.sqrt(numpy.var(x, ddof=0))).

Any other comments?

@MarcoGorelli MarcoGorelli changed the title Fix doc standardscaler Fix "StandardScaler that supports option to use unbiased estimator of variance" Jan 10, 2019
@MarcoGorelli MarcoGorelli changed the title Fix "StandardScaler that supports option to use unbiased estimator of variance" [MRG] Fix "StandardScaler that supports option to use unbiased estimator of variance" Jan 10, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it in a Notes section and explain that the choice of ddof is unlikely to affect ML performance

@MarcoGorelli
Copy link
Contributor Author

What do you mean by 'Put it in a Notes section'? I've searched 'notes' in the contributing guidelines, but all I can find is a section about 'working notes'

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.StratifiedKFold.html for notes section.
I agree that we should emphasize the difference in notes section, since we're not going to modify our implementation.

@@ -478,7 +478,10 @@ class StandardScaler(BaseEstimator, TransformerMixin):

where `u` is the mean of the training samples or zero if `with_mean=False`,
and `s` is the standard deviation of the training samples or one if
`with_std=False`.
`with_std=False`. Note that `s` is a biased estimator of the standard
deviation, equivalent to numpy.sqrt(numpy.var(x, ddof=0)), and that it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use numpy.std instead?

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @MarcoGorelli

@@ -574,6 +574,10 @@ class StandardScaler(BaseEstimator, TransformerMixin):
-----
NaNs are treated as missing values: disregarded in fit, and maintained in
transform.

We use a biased estimator for the standard deviation, equivalent to
`numpy.std(x, ddof=0)`. Note, however, that the choice of `ddof` is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove however?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, before merge, can we get this note replicated in the scale function?

@jnothman jnothman changed the title [MRG] Fix "StandardScaler that supports option to use unbiased estimator of variance" [MRG] DOC Mention StandardScaler ddof Jan 13, 2019
@qinhanmin2014 qinhanmin2014 merged commit eab7e8b into scikit-learn:master Jan 14, 2019
@MarcoGorelli MarcoGorelli deleted the fix-doc-standardscaler branch January 14, 2019 15:15
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StandardScaler that supports option to use unbiased estimator of variance
3 participants