-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] DOC Mention StandardScaler ddof #12950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] DOC Mention StandardScaler ddof #12950
Conversation
…he estimator of the standard deviation is the biased one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put it in a Notes section and explain that the choice of ddof is unlikely to affect ML performance
What do you mean by 'Put it in a Notes section'? I've searched 'notes' in the contributing guidelines, but all I can find is a section about 'working notes' |
…d affect model performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.StratifiedKFold.html for notes section.
I agree that we should emphasize the difference in notes section, since we're not going to modify our implementation.
sklearn/preprocessing/data.py
Outdated
@@ -478,7 +478,10 @@ class StandardScaler(BaseEstimator, TransformerMixin): | |||
|
|||
where `u` is the mean of the training samples or zero if `with_mean=False`, | |||
and `s` is the standard deviation of the training samples or one if | |||
`with_std=False`. | |||
`with_std=False`. Note that `s` is a biased estimator of the standard | |||
deviation, equivalent to numpy.sqrt(numpy.var(x, ddof=0)), and that it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use numpy.std
instead?
….std instead of np.var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @MarcoGorelli
sklearn/preprocessing/data.py
Outdated
@@ -574,6 +574,10 @@ class StandardScaler(BaseEstimator, TransformerMixin): | |||
----- | |||
NaNs are treated as missing values: disregarded in fit, and maintained in | |||
transform. | |||
|
|||
We use a biased estimator for the standard deviation, equivalent to | |||
`numpy.std(x, ddof=0)`. Note, however, that the choice of `ddof` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove however?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, before merge, can we get this note replicated in the scale
function?
This reverts commit 6a85a17.
This reverts commit 6a85a17.
Reference Issues/PRs
Fixes #7757
What does this implement/fix? Explain your changes.
Expands the documentation so it's clear that the estimate of the standard deviation in StandardScaler is the biased one (equivalent to numpy.sqrt(numpy.var(x, ddof=0))).
Any other comments?