-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Add details to StandardScaler calculation #12446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/preprocessing/data.py
Outdated
@@ -525,8 +531,8 @@ class StandardScaler(BaseEstimator, TransformerMixin): | |||
Attributes | |||
---------- | |||
scale_ : ndarray or None, shape (n_features,) | |||
Per feature relative scaling of the data. Equal to ``None`` when | |||
``with_std=False``. | |||
Per feature relative scaling of the data. Computed using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Computed using" is ambiguous. Better to just say "Equal to".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe you can use This is calculated using np.sqrt(...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @robert-dodier and @eamanu, thank you for your suggestions. I ended choosing the This is calculated using
to not repeat the Equal to
in the next sentence. Let me know what you guys think.
sklearn/preprocessing/data.py
Outdated
|
||
z = (x - mean_) / scale_ | ||
|
||
But the formula can be different if `with_mean=False` or `with_std=False`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two problems here. (1) There is actually one mean_ and one scale_ per column. The documentation should make this explicit.
(2) The documentation should say what is the different formula that is used when with_mean = False
and with_std = False
, and when both are false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed mean_
to u
and scale_
to std
to indicate this is more a high level explanation and not an actual implementation. I have also added their respective values when with_mean=False
or with_std=False
.
I agree that we have one mean_
and one scale_
per column, but this is covered at least partially in the next paragraph: Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.
sklearn/preprocessing/data.py
Outdated
|
||
z = (x - mean_) / scale_ | ||
|
||
But the formula can be different if `with_mean=False` or `with_std=False`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?
sklearn/preprocessing/data.py
Outdated
@@ -525,8 +531,8 @@ class StandardScaler(BaseEstimator, TransformerMixin): | |||
Attributes | |||
---------- | |||
scale_ : ndarray or None, shape (n_features,) | |||
Per feature relative scaling of the data. Equal to ``None`` when | |||
``with_std=False``. | |||
Per feature relative scaling of the data. Computed using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe you can use This is calculated using np.sqrt(...
@tuliocasagrande Tell me if you need help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok for me
@tuliocasagrande if @robert-dodier get the approve, you should edit the PR title to [MRG] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only nitpicks, thanks @tuliocasagrande
sklearn/preprocessing/data.py
Outdated
z = (x - u) / std | ||
|
||
where `u` is the mean of the population or zero if `with_mean=False`, and | ||
`std` is the standard deviation of the population or one if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would either use explicit names, mean
and std
, or single letters as in math expressions, u
and s
, but not a mix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, @TomDLT. I was initially considering using μ
and σ
, but I'm not sure how they'd be rendered. I'm sticking to 'u' and 's'.
sklearn/preprocessing/data.py
Outdated
|
||
z = (x - u) / std | ||
|
||
where `u` is the mean of the population or zero if `with_mean=False`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term population
is rather vague. What about mean of the training samples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds better, thanks @TomDLT
b34cab9
to
5c1e357
Compare
How can I see what is the latest version of the proposed patch? I see 5c1e357 which doesn't seem to contain suggestions that have been made during this discussion. Does 5c1e357 contain everything that is proposed? Is there something else I should look at? Thanks for any info. |
@robert-dodier 5c1e357 is indeed the last commit. If you're using GitHub, you can check the previous discussions. Some of them are still unresolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think preprocessing.scale accomplishes a similar task (but doesn't use the Transformer api), maybe it could also be updated with a similar equation?
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @tuliocasagrande
For everyone here : Feel free to submit PR to improve the docstring of |
Hello!
This addresses some documentation issues raised on #12438.
1- Define standard scaler formula
2- Explicit how
scale_
is calculatedThanks for reviewing this!
Closes #12438