DOC Add details to StandardScaler calculation #12446

tuliocasagrande · 2018-10-23T19:29:06Z

Hello!

This addresses some documentation issues raised on #12438.

1- Define standard scaler formula

2- Explicit how scale_ is calculated

Thanks for reviewing this!
Closes #12438

robert-dodier · 2018-10-23T21:07:24Z

sklearn/preprocessing/data.py

@@ -525,8 +531,8 @@ class StandardScaler(BaseEstimator, TransformerMixin):
    Attributes
    ----------
    scale_ : ndarray or None, shape (n_features,)
-        Per feature relative scaling of the data. Equal to ``None`` when
-        ``with_std=False``.
+        Per feature relative scaling of the data. Computed using


"Computed using" is ambiguous. Better to just say "Equal to".

Or maybe you can use This is calculated using np.sqrt(...

Hi @robert-dodier and @eamanu, thank you for your suggestions. I ended choosing the This is calculated using to not repeat the Equal to in the next sentence. Let me know what you guys think.

robert-dodier · 2018-10-23T21:07:26Z

sklearn/preprocessing/data.py

+
+        z = (x - mean_) / scale_
+
+    But the formula can be different if `with_mean=False` or `with_std=False`.


I see two problems here. (1) There is actually one mean_ and one scale_ per column. The documentation should make this explicit.
(2) The documentation should say what is the different formula that is used when with_mean = False and with_std = False, and when both are false.

I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?

I changed mean_ to u and scale_ to std to indicate this is more a high level explanation and not an actual implementation. I have also added their respective values when with_mean=False or with_std=False.

I agree that we have one mean_ and one scale_ per column, but this is covered at least partially in the next paragraph: Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

eamanu · 2018-10-24T12:20:50Z

sklearn/preprocessing/data.py

+
+        z = (x - mean_) / scale_
+
+    But the formula can be different if `with_mean=False` or `with_std=False`.


I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?

eamanu · 2018-10-24T12:22:16Z

sklearn/preprocessing/data.py

@@ -525,8 +531,8 @@ class StandardScaler(BaseEstimator, TransformerMixin):
    Attributes
    ----------
    scale_ : ndarray or None, shape (n_features,)
-        Per feature relative scaling of the data. Equal to ``None`` when
-        ``with_std=False``.
+        Per feature relative scaling of the data. Computed using


Or maybe you can use This is calculated using np.sqrt(...

eamanu · 2018-10-24T12:22:58Z

@tuliocasagrande Tell me if you need help

eamanu

It's ok for me

eamanu · 2018-10-25T11:49:26Z

@tuliocasagrande if @robert-dodier get the approve, you should edit the PR title to [MRG]

TomDLT

Only nitpicks, thanks @tuliocasagrande

TomDLT · 2018-10-25T11:49:58Z

sklearn/preprocessing/data.py

+        z = (x - u) / std
+
+    where `u` is the mean of the population or zero if `with_mean=False`, and
+    `std` is the standard deviation of the population or one if


I would either use explicit names, mean and std, or single letters as in math expressions, u and s, but not a mix.

Good call, @TomDLT. I was initially considering using μ and σ, but I'm not sure how they'd be rendered. I'm sticking to 'u' and 's'.

TomDLT · 2018-10-25T11:53:41Z

sklearn/preprocessing/data.py

+
+        z = (x - u) / std
+
+    where `u` is the mean of the population or zero if `with_mean=False`, and


The term population is rather vague. What about mean of the training samples?

Sounds better, thanks @TomDLT

sklearn/preprocessing/data.py

Closes scikit-learn#12438

robert-dodier · 2018-10-25T17:40:20Z

How can I see what is the latest version of the proposed patch? I see 5c1e357 which doesn't seem to contain suggestions that have been made during this discussion. Does 5c1e357 contain everything that is proposed? Is there something else I should look at? Thanks for any info.

tuliocasagrande · 2018-10-25T18:01:46Z

@robert-dodier 5c1e357 is indeed the last commit.
I have rebased, but now that you asked that I realized it made harder to track changes.

If you're using GitHub, you can check the previous discussions. Some of them are still unresolved.

xhluca

I think preprocessing.scale accomplishes a similar task (but doesn't use the Transformer api), maybe it could also be updated with a similar equation?
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale

qinhanmin2014

LGTM, thanks @tuliocasagrande

qinhanmin2014 · 2018-11-04T15:13:20Z

For everyone here : Feel free to submit PR to improve the docstring of preprocessing.scale

…2446)" This reverts commit 6ac199c.

tuliocasagrande mentioned this pull request Oct 23, 2018

StandardScaler documentation details missing #12438

Closed

robert-dodier reviewed Oct 23, 2018

View reviewed changes

eamanu suggested changes Oct 24, 2018

View reviewed changes

eamanu approved these changes Oct 24, 2018

View reviewed changes

TomDLT approved these changes Oct 25, 2018

View reviewed changes

jeremiedbb reviewed Oct 25, 2018

View reviewed changes

sklearn/preprocessing/data.py Show resolved Hide resolved

DOC Add details to StandardScaler calculation

5c1e357

Closes scikit-learn#12438

tuliocasagrande force-pushed the scaler branch from b34cab9 to 5c1e357 Compare October 25, 2018 14:22

xhluca reviewed Oct 26, 2018

View reviewed changes

qinhanmin2014 approved these changes Nov 4, 2018

View reviewed changes

qinhanmin2014 merged commit a028416 into scikit-learn:master Nov 4, 2018

tuliocasagrande deleted the scaler branch November 7, 2018 15:00

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018

DOC Add details to StandardScaler calculation (scikit-learn#12446)

c12a0f0

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

DOC Add details to StandardScaler calculation (scikit-learn#12446)

c57af79

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

DOC Add details to StandardScaler calculation (scikit-learn#12446)

a75d5aa

qinhanmin2014 mentioned this pull request Jan 29, 2019

DOC More details about the attributes in MinMaxScaler #13029

Merged

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

DOC Add details to StandardScaler calculation (scikit-learn#12446)

6ac199c

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC Add details to StandardScaler calculation (scikit-learn#1…

1f6ca73

…2446)" This reverts commit 6ac199c.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC Add details to StandardScaler calculation (scikit-learn#1…

0b7204c

…2446)" This reverts commit 6ac199c.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

DOC Add details to StandardScaler calculation (scikit-learn#12446)

a9c73d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Add details to StandardScaler calculation #12446

DOC Add details to StandardScaler calculation #12446

tuliocasagrande commented Oct 23, 2018

robert-dodier Oct 23, 2018

eamanu Oct 24, 2018

tuliocasagrande Oct 24, 2018

robert-dodier Oct 23, 2018

eamanu Oct 24, 2018

tuliocasagrande Oct 24, 2018

eamanu Oct 24, 2018

eamanu Oct 24, 2018

eamanu commented Oct 24, 2018

eamanu left a comment

eamanu commented Oct 25, 2018

TomDLT left a comment

TomDLT Oct 25, 2018

tuliocasagrande Oct 25, 2018

TomDLT Oct 25, 2018

tuliocasagrande Oct 25, 2018

robert-dodier commented Oct 25, 2018

tuliocasagrande commented Oct 25, 2018

xhluca left a comment

qinhanmin2014 left a comment

qinhanmin2014 commented Nov 4, 2018


		z = (x - mean_) / scale_

		But the formula can be different if `with_mean=False` or `with_std=False`.


		z = (x - u) / std

		where `u` is the mean of the population or zero if `with_mean=False`, and

DOC Add details to StandardScaler calculation #12446

DOC Add details to StandardScaler calculation #12446

Conversation

tuliocasagrande commented Oct 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eamanu commented Oct 24, 2018

eamanu left a comment

Choose a reason for hiding this comment

eamanu commented Oct 25, 2018

TomDLT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robert-dodier commented Oct 25, 2018

tuliocasagrande commented Oct 25, 2018

xhluca left a comment

Choose a reason for hiding this comment

qinhanmin2014 left a comment

Choose a reason for hiding this comment

qinhanmin2014 commented Nov 4, 2018