-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
StandardScaler that supports option to use unbiased estimator of variance #7757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does it really matter? It seems to me that it's one of those things that
people (academics in particular) love to argue, but has no impact on
something like predictive accuracy in realistic settings.
PS: don't read me wrong, I am an academic. I love to argue. But I try to
do that in papers, not software.
|
No problem. It was due to the normalization in StandardScaler.
Yes and no :). You can leave it as it is. I have created my own dirty StandardScaler :) |
@spilkjir just out of curiosity: how many samples do you have that that made a difference? |
This dataset: https://www.kaggle.com/c/melbourne-university-seizure-prediction features matrix ~ [2350x500] instances x features Logistic regression (LR) with l1 penalty. (3-GroupKFold CV) (C was kept small to rather promote sparsity in feature vector). My conclusion was that as the data are high dimensional a small change in normalization causes different features to be selected with different perf. From user point of view I would like to have an easy option to chose ddof in StandardScaler. |
My conclusion was that as the data are high dimensional a small change in
normalization causes different features to be selected with different perf.
The bad thing is that I don't know, which solution (standardization with ddof=0
or ddof=1) was more right or wrong. Only from theory, ddof should be 1 as
population mean is not known.
Neither are right or wrong: l1 penalties for variable selection is
unstable and does not garanty in any way some form of control on
recovery. If you're interested in that, I would more advise you to
bootstrap the data and look at what's stable.
|
Yeah if a change in scale of 0.0004 changes your outcome drastically, your method is probably not very robust. |
IMHO, any difference that it creates in your results is in the error bars
(and they may be very large in high-dimensional settings) and you
shouldn't trust results in these error bars. They will probably not be
reproduce on new data.
|
ok, thanks for the info. |
It's true that using slightly biased stdev estimator by itself does not make much of a difference, and if it does one should look into how to make the model more robust. The bigger problem is that I was trying to reproduce some results that I obtained earlier in R and could not. Yes, the differences are small and do not really impact the predictive power. What it does impact is the time spent trying to understand where that difference comes from in the situations where one expects none, just in case there is a bigger issue lurking underneath. Thanks. |
@atolpin, could our documentation be improved? Please offer a PR to clarify
the docs.
|
Fwiw this seems worth fixing to me. Although it makes little difference in practice, it makes sense to be correct if we can be and it will carry on causing confusion for anyone trying to double check their results. |
Why not making this clear in the docstring of StandardScaler? e.g in the Notes section? |
So there's consensus to improve the doc. |
I think that makes a lot of sense. Providing unbiased estimator option
would also make it backward compatible. Eventually, you can default to
unbiased option.
…On Thu, Jan 3, 2019, 10:18 Hanmin Qin ***@***.*** wrote:
Reopened #7757 <#7757>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7757 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJo8PM_5rQqUKaJz4XR0RsdAZrCFyf7Tks5u_h8_gaJpZM4KhEOT>
.
|
I don't know if providing the unbiased option is useful when looking at the previous comments of this issue (basically that it makes little difference in practice). But clarifying the documentation would allow users to know which estimator is used and avoid confusion.
Defaulting to the unbiased option would require a deprecation cycle and I don't think this is worth it, considering again that this makes little difference in practice. |
Having a ddof parameter would certainly make our difference clearer, even if it's not very useful in practice in an ML context |
Having a ddof parameter would certainly make our difference clearer, even if it's not very useful in practice in an ML context
I'm not enthusiastic about it: it's feature creep. Aiming for 100%
compatibility with other statistical systems is something that can take
us far in terms of multiplication of options, for a user benefit that I
don't really understand given that some of these choices are arbitrary.
|
+1, so there's +2 and no -1. Tagging as help wanted. |
Adding a parameter seems enough from my side. |
Having a ddof parameter would certainly make our difference clearer, even if it's not very useful in practice in an ML context
+1, so there's +2 and no -1. Tagging as help wanted.
That's was a -1 from my side. Sorry if I was not clear. As far as I am
concerned, this is feature creep.
|
Sorry, let's first update the doc. |
TBH, I'm not keen on this in order to be comparable to R. I suggested
adding a parameter as a form of blatant documentation, really. Which is, of
course, not what functionality should be for, but I do think it creates for
literacy. Let's stick to a documentation change.
|
Hi all, I'm new to contributing, though have been using sklearn for over a year. This issue is still open, though from the comments it's not clear to me what you want to happen. Should I work on adding a ddof parameter? Would this be a good starting point? |
@MarcoGorelli Only the doc needs to be clarified. |
Ok, thanks, wasn't sure how much consensus there was over what needs to be done. It is ok if I try clarifying the doc then? |
PR welcome
|
Ok, done |
One thing would be great if in documentation for scoring in cross_val_score
you would include a link to
https://scikit-learn.org/stable/modules/model_evaluation.html, so I would
not have to search for the list of available strings every time.
So far, it's my only complain. In general I find documentation well written.
Thanks,
Anatoly
…On Wed, Jan 2, 2019 at 4:03 PM Joel Nothman ***@***.***> wrote:
@atolpin, could our documentation be improved? Please offer a PR to clarify
the docs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7757 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJo8PKt8gleHTlVklIS5hJtTaRv44vWDks5u_R6EgaJpZM4KhEOT>
.
|
@atolpin You can always use |
Description
It would be nice to provide StandardScaler with an option to chose unbiased estimator of variance,
It it sometimes convenient to use unbiased estimator of variance: numpy.var(X, ddof=1)
instead of default: numpy.var(X, ddof=0)
So far StandardScaler does not allow to set ddof.
Versions
Linux-4.4.0-45-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Jul 1 2016, 15:12:24) \n[GCC 5.4.0 20160609]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.19.dev0')
The text was updated successfully, but these errors were encountered: