Skip to content

DOC Add details to StandardScaler calculation #12446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 4, 2018

Conversation

tuliocasagrande
Copy link
Contributor

Hello!

This addresses some documentation issues raised on #12438.

1- Define standard scaler formula

2- Explicit how scale_ is calculated

Thanks for reviewing this!
Closes #12438

@@ -525,8 +531,8 @@ class StandardScaler(BaseEstimator, TransformerMixin):
Attributes
----------
scale_ : ndarray or None, shape (n_features,)
Per feature relative scaling of the data. Equal to ``None`` when
``with_std=False``.
Per feature relative scaling of the data. Computed using

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Computed using" is ambiguous. Better to just say "Equal to".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe you can use This is calculated using np.sqrt(...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @robert-dodier and @eamanu, thank you for your suggestions. I ended choosing the This is calculated using to not repeat the Equal to in the next sentence. Let me know what you guys think.


z = (x - mean_) / scale_

But the formula can be different if `with_mean=False` or `with_std=False`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two problems here. (1) There is actually one mean_ and one scale_ per column. The documentation should make this explicit.
(2) The documentation should say what is the different formula that is used when with_mean = False and with_std = False, and when both are false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed mean_ to u and scale_ to std to indicate this is more a high level explanation and not an actual implementation. I have also added their respective values when with_mean=False or with_std=False.

I agree that we have one mean_ and one scale_ per column, but this is covered at least partially in the next paragraph: Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.


z = (x - mean_) / scale_

But the formula can be different if `with_mean=False` or `with_std=False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?

@@ -525,8 +531,8 @@ class StandardScaler(BaseEstimator, TransformerMixin):
Attributes
----------
scale_ : ndarray or None, shape (n_features,)
Per feature relative scaling of the data. Equal to ``None`` when
``with_std=False``.
Per feature relative scaling of the data. Computed using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe you can use This is calculated using np.sqrt(...

@eamanu
Copy link
Contributor

eamanu commented Oct 24, 2018

@tuliocasagrande Tell me if you need help

Copy link
Contributor

@eamanu eamanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok for me

@eamanu
Copy link
Contributor

eamanu commented Oct 25, 2018

@tuliocasagrande if @robert-dodier get the approve, you should edit the PR title to [MRG]

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicks, thanks @tuliocasagrande

z = (x - u) / std

where `u` is the mean of the population or zero if `with_mean=False`, and
`std` is the standard deviation of the population or one if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either use explicit names, mean and std, or single letters as in math expressions, u and s, but not a mix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, @TomDLT. I was initially considering using μ and σ, but I'm not sure how they'd be rendered. I'm sticking to 'u' and 's'.


z = (x - u) / std

where `u` is the mean of the population or zero if `with_mean=False`, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term population is rather vague. What about mean of the training samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds better, thanks @TomDLT

@robert-dodier
Copy link

How can I see what is the latest version of the proposed patch? I see 5c1e357 which doesn't seem to contain suggestions that have been made during this discussion. Does 5c1e357 contain everything that is proposed? Is there something else I should look at? Thanks for any info.

@tuliocasagrande
Copy link
Contributor Author

@robert-dodier 5c1e357 is indeed the last commit.
I have rebased, but now that you asked that I realized it made harder to track changes.

If you're using GitHub, you can check the previous discussions. Some of them are still unresolved.

Copy link
Contributor

@xhluca xhluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think preprocessing.scale accomplishes a similar task (but doesn't use the Transformer api), maybe it could also be updated with a similar equation?
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tuliocasagrande

@qinhanmin2014 qinhanmin2014 merged commit a028416 into scikit-learn:master Nov 4, 2018
@qinhanmin2014
Copy link
Member

For everyone here : Feel free to submit PR to improve the docstring of preprocessing.scale

@tuliocasagrande tuliocasagrande deleted the scaler branch November 7, 2018 15:00
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants