add standard deviation calculation for linear regression #8872

jl2922 · 2017-05-13T14:46:38Z

Standard deviation gives the half width of 68% confidence interval of the estimated coefficients. This PR adds standard deviation calculation for the coefficients in the linear regression model.

The computational method is described here.

#8870
@glemaitre

…linear regression

jl2922 · 2017-05-13T19:09:36Z

Several tests in sklearn/linear_model/tests/test_logistic.py appears flanky. Shall we disable these?

glemaitre · 2017-05-13T23:24:57Z

Not really sure what do you mean by "flanky". The tests are here to ensure that the API does not get broken. In this regard, it seems that they are doing their jobs right now ;)
Double check that you did not make changes which make the code not being back-compatible.

agramfort · 2017-05-14T07:19:13Z

this feature has been discussed in the past and it was decided that people should use statsmodels for this, not sklearn which targets prediction.

if we still agree on this I am -1 on this PR.

cc @GaelVaroquaux

jl2922 · 2017-05-14T10:12:54Z

@glemaitre Sorry, I meant 'flaky'. If you run these tests with the latest merged commit, they will also fail sometimes. All the failed ones relate to comparing the results from two different methods which both do not give the accurate answer. And I don't think it is not a good practice compare results from two functions in tests, but rather each of them shall be compared with a theoretically correct result.

jl2922 · 2017-05-14T10:17:39Z

@agramfort Regarding whether this feature is suitable or necessary, I just want to point out that statsmodel does not even give the correct answer for coefficients, and their standard deviations are therefore almost meaningless, which is why I ended up wrote my own version for my work, which requires serious statistical results. At least I couldn't find any packages online that offers all of the following features:

weighted OLS
with standard deviation (t test)
give correct results
multivariate

glemaitre · 2017-05-14T10:25:52Z

If you run these tests with the latest merged commit, they will also fail sometimes.

If you can raise an issue with those ones such that we can reproduce it and change the tests accordingly it will be great.

All the failed ones relate to comparing the results from two different methods which both do not give the accurate answer. And I don't think it is not a good practice compare results from two functions in tests, but rather each of them shall be compared with a theoretically correct result.

I partially agree since it can depends of the context (if the two functions are supposed to return the same thing, you can actually check the outputs and additionally check the theoretical result).

agramfort · 2017-05-15T13:28:33Z

I still think that for community coordination it's better to fix in statsmodels than doing it here ...

cc @josef-pkt thoughts?

josef-pkt · 2017-05-15T14:20:53Z

I just want to point out that statsmodel does not even give the correct answer for coefficients, and their standard deviations are therefore almost meaningless,

a bug report to statsmodels would be helpful.
However, our test coverage is pretty good, so I don't think this is the case. There might be different defaults or options.

What's the scope of scikit-learn is not my issue.
However, standard errors for parameter estimates and inference are only relatively straight forward in the traditional statistics cases. Inference after regularization and for machine learning methods with variable selection are a difficult issue. For example, for LASSO there has been only theoretical developments in recent years to derive appropriate standard errors and inference. The usual standard errors and confidence intervals don't take regularization and variable selection into account, but often those "simplified" standard errors are all we have.
The plan is to improve and include better or more inference for those cases in statsmodels in the future, but it's slow going.

The only generic approach would be to compute bootstrap standard errors or inference over the entire algorithm, but that still relies on choosing the right bootstrap if either heteroscedasticity or correlation between observations is present. (It also might not be valid for all algorithms, but I don't have an overview there.)

josef-pkt · 2017-05-15T16:51:36Z

I checked the test case from this PR with statsmodels OLS and WLS

for OLS absolute and relative difference

>>> res.bse - [2.10596119117, 3.97159995702e-05, 1.48683754360e-06]
array([  4.13202805e-11,   8.17535872e-16,   2.63994759e-17])
>>> res.bse / [2.10596119117, 3.97159995702e-05, 1.48683754360e-06] - 1
array([  1.96205274e-11,   2.05846451e-11,   1.77553527e-11])

for WLS absolute and relative difference

>>> resw.bse - [6171.11372672, 45032.21987029]
array([ -1.07138476e-09,   1.31694833e-09])
>>> resw.bse / [6171.11372672, 45032.21987029] - 1
array([ -1.73638881e-13,   2.93098879e-14])

There are differences at float64 precision between packages because different linear algebra functions are used, but SVD, that both scikit-learn and statsmodels use, is supposed to be numerically the most stable, most likely more stable than pivoting QR that base R uses.

josef-pkt · 2017-05-15T17:12:31Z

And to illustrate:
We haven't had any complaints for linear regression except for the singular design matrix case. The use of pinv/SVD hasn't changed since the beginning of statsmodels.

Here is one of my investigation into edge cases for linear regression from 2012
https://jpktd.blogspot.ca/2012/03/numerical-accuracy-in-linear-least.html
using the NIST test case that caused the most problems for statistical packages.

jl2922 · 2017-05-15T20:48:59Z

It turns out that I forgot to add constants to x when I was trying statsmodel since in other stats packages that I have used the default is with the constant. Sorry for the wrong message and I appreciate the usage of SVD @josef-pkt

Junhao Li and others added 4 commits May 13, 2017 10:36

add standard deviation calculation for the estimated coefficients in …

eb38775

…linear regression

fix test errors

5352acc

push build

dbcdb2d

build

e1694da

jl2922 mentioned this pull request May 14, 2017

Flaky tests in logistic regression #8879

Closed

jl2922 closed this May 15, 2017

amueller mentioned this pull request May 15, 2017

add standard deviation to linear regression #8870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

add standard deviation calculation for linear regression #8872

add standard deviation calculation for linear regression #8872

Uh oh!

jl2922 commented May 13, 2017

Uh oh!

jl2922 commented May 13, 2017

Uh oh!

glemaitre commented May 13, 2017

Uh oh!

agramfort commented May 14, 2017

Uh oh!

jl2922 commented May 14, 2017

Uh oh!

jl2922 commented May 14, 2017

Uh oh!

glemaitre commented May 14, 2017

Uh oh!

agramfort commented May 15, 2017

Uh oh!

josef-pkt commented May 15, 2017

Uh oh!

josef-pkt commented May 15, 2017

Uh oh!

josef-pkt commented May 15, 2017

Uh oh!

jl2922 commented May 15, 2017

Uh oh!

Uh oh!

Uh oh!

add standard deviation calculation for linear regression #8872

add standard deviation calculation for linear regression #8872

Uh oh!

Conversation

jl2922 commented May 13, 2017

Uh oh!

jl2922 commented May 13, 2017

Uh oh!

glemaitre commented May 13, 2017

Uh oh!

agramfort commented May 14, 2017

Uh oh!

jl2922 commented May 14, 2017

Uh oh!

jl2922 commented May 14, 2017

Uh oh!

glemaitre commented May 14, 2017

Uh oh!

agramfort commented May 15, 2017

Uh oh!

josef-pkt commented May 15, 2017

Uh oh!

josef-pkt commented May 15, 2017

Uh oh!

josef-pkt commented May 15, 2017

Uh oh!

jl2922 commented May 15, 2017

Uh oh!

Uh oh!