Skip to content

StandardScaler that supports option to use unbiased estimator of variance #7757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jirispilka opened this issue Oct 26, 2016 · 29 comments · Fixed by #12950
Closed

StandardScaler that supports option to use unbiased estimator of variance #7757

jirispilka opened this issue Oct 26, 2016 · 29 comments · Fixed by #12950
Labels
Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted

Comments

@jirispilka
Copy link

Description

It would be nice to provide StandardScaler with an option to chose unbiased estimator of variance,

It it sometimes convenient to use unbiased estimator of variance: numpy.var(X, ddof=1)
instead of default: numpy.var(X, ddof=0)

So far StandardScaler does not allow to set ddof.

Versions

Linux-4.4.0-45-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Jul 1 2016, 15:12:24) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.19.dev0')

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 26, 2016 via email

@jirispilka
Copy link
Author

jirispilka commented Oct 26, 2016

No problem.
Just to tell the whole story. I've moved my 'pipeline' code from matlab to scikit and spent couple of hours trying to figure out why the matlab and scikit produce different results (selected features and classification performance).

It was due to the normalization in StandardScaler.
Providing this option I was just trying to save someone else those couple of hours.

It seems to me that it's one of those things that people (academics in particular) love to argue, but has no impact on something like predictive accuracy in realistic settings.

Yes and no :).
I'm sorry, I'm new here, do not want to start an academic discussions. This was only suggestion.

You can leave it as it is. I have created my own dirty StandardScaler :)

@amueller
Copy link
Member

@spilkjir just out of curiosity: how many samples do you have that that made a difference?

@jirispilka
Copy link
Author

This dataset: https://www.kaggle.com/c/melbourne-university-seizure-prediction
Seizure prediction from EEG.

features matrix ~ [2350x500] instances x features
number of normal 2200
number of pathological 150.

Logistic regression (LR) with l1 penalty. (3-GroupKFold CV) (C was kept small to rather promote sparsity in feature vector).

My conclusion was that as the data are high dimensional a small change in normalization causes different features to be selected with different perf.
The bad thing is that I don't know, which solution (standardization with ddof=0 or ddof=1) was more right or wrong. Only from theory, ddof should be 1 as population mean is not known.

From user point of view I would like to have an easy option to chose ddof in StandardScaler.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 27, 2016 via email

@amueller
Copy link
Member

Yeah if a change in scale of 0.0004 changes your outcome drastically, your method is probably not very robust.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 27, 2016 via email

@jirispilka
Copy link
Author

ok, thanks for the info.
I'm sorry for the spam

@atolpin
Copy link

atolpin commented Jan 2, 2019

It's true that using slightly biased stdev estimator by itself does not make much of a difference, and if it does one should look into how to make the model more robust. The bigger problem is that I was trying to reproduce some results that I obtained earlier in R and could not. Yes, the differences are small and do not really impact the predictive power. What it does impact is the time spent trying to understand where that difference comes from in the situations where one expects none, just in case there is a bigger issue lurking underneath. Thanks.

@jnothman
Copy link
Member

jnothman commented Jan 2, 2019 via email

@lesshaste
Copy link

Fwiw this seems worth fixing to me. Although it makes little difference in practice, it makes sense to be correct if we can be and it will carry on causing confusion for anyone trying to double check their results.

@albertcthomas
Copy link
Contributor

Why not making this clear in the docstring of StandardScaler? e.g in the Notes section?

@qinhanmin2014
Copy link
Member

So there's consensus to improve the doc.
I'll vote +1 to at least provide an option to use unbiased estimation. It will make users easier to reproduce results from R.
Reopen this one and close the new issue.

@qinhanmin2014 qinhanmin2014 reopened this Jan 3, 2019
@atolpin
Copy link

atolpin commented Jan 3, 2019 via email

@albertcthomas
Copy link
Contributor

albertcthomas commented Jan 3, 2019

I don't know if providing the unbiased option is useful when looking at the previous comments of this issue (basically that it makes little difference in practice). But clarifying the documentation would allow users to know which estimator is used and avoid confusion.

Eventually, you can default to unbiased option.

Defaulting to the unbiased option would require a deprecation cycle and I don't think this is worth it, considering again that this makes little difference in practice.

@jnothman
Copy link
Member

jnothman commented Jan 3, 2019

Having a ddof parameter would certainly make our difference clearer, even if it's not very useful in practice in an ML context

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 3, 2019 via email

@qinhanmin2014
Copy link
Member

Having a ddof parameter would certainly make our difference clearer, even if it's not very useful in practice in an ML context

+1, so there's +2 and no -1. Tagging as help wanted.

@qinhanmin2014 qinhanmin2014 added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Jan 4, 2019
@qinhanmin2014
Copy link
Member

Eventually, you can default to unbiased option.

Adding a parameter seems enough from my side.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 4, 2019 via email

@qinhanmin2014
Copy link
Member

That's was a -1 from my side. Sorry if I was not clear. As far as I am concerned, this is feature creep.

Sorry, let's first update the doc.

@jnothman
Copy link
Member

jnothman commented Jan 6, 2019 via email

@MarcoGorelli
Copy link
Contributor

Hi all,

I'm new to contributing, though have been using sklearn for over a year.

This issue is still open, though from the comments it's not clear to me what you want to happen. Should I work on adding a ddof parameter? Would this be a good starting point?

@albertcthomas
Copy link
Contributor

albertcthomas commented Jan 9, 2019

Should I work on adding a ddof parameter? Would this be a good starting point?

@MarcoGorelli Only the doc needs to be clarified.

@MarcoGorelli
Copy link
Contributor

Ok, thanks, wasn't sure how much consensus there was over what needs to be done.

It is ok if I try clarifying the doc then?

@jnothman
Copy link
Member

jnothman commented Jan 10, 2019 via email

@MarcoGorelli
Copy link
Contributor

Ok, done

@atolpin
Copy link

atolpin commented Feb 7, 2019 via email

@qinhanmin2014
Copy link
Member

@atolpin You can always use sorted(sklearn.metrics.SCORERS.keys()) to get available scorers. Please do not post unrelated comment in an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants