Skip to content

add standard deviation calculation for linear regression #8872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

add standard deviation calculation for linear regression #8872

wants to merge 4 commits into from

Conversation

jl2922
Copy link

@jl2922 jl2922 commented May 13, 2017

Standard deviation gives the half width of 68% confidence interval of the estimated coefficients. This PR adds standard deviation calculation for the coefficients in the linear regression model.

The computational method is described here.

#8870
@glemaitre

@jl2922
Copy link
Author

jl2922 commented May 13, 2017

Several tests in sklearn/linear_model/tests/test_logistic.py appears flanky. Shall we disable these?

@glemaitre
Copy link
Member

Not really sure what do you mean by "flanky". The tests are here to ensure that the API does not get broken. In this regard, it seems that they are doing their jobs right now ;)
Double check that you did not make changes which make the code not being back-compatible.

@agramfort
Copy link
Member

this feature has been discussed in the past and it was decided that people should use statsmodels for this, not sklearn which targets prediction.

if we still agree on this I am -1 on this PR.

cc @GaelVaroquaux

@jl2922
Copy link
Author

jl2922 commented May 14, 2017

@glemaitre Sorry, I meant 'flaky'. If you run these tests with the latest merged commit, they will also fail sometimes. All the failed ones relate to comparing the results from two different methods which both do not give the accurate answer. And I don't think it is not a good practice compare results from two functions in tests, but rather each of them shall be compared with a theoretically correct result.

@jl2922
Copy link
Author

jl2922 commented May 14, 2017

@agramfort Regarding whether this feature is suitable or necessary, I just want to point out that statsmodel does not even give the correct answer for coefficients, and their standard deviations are therefore almost meaningless, which is why I ended up wrote my own version for my work, which requires serious statistical results. At least I couldn't find any packages online that offers all of the following features:

  1. weighted OLS
  2. with standard deviation (t test)
  3. give correct results
  4. multivariate

@glemaitre
Copy link
Member

If you run these tests with the latest merged commit, they will also fail sometimes.

If you can raise an issue with those ones such that we can reproduce it and change the tests accordingly it will be great.

All the failed ones relate to comparing the results from two different methods which both do not give the accurate answer. And I don't think it is not a good practice compare results from two functions in tests, but rather each of them shall be compared with a theoretically correct result.

I partially agree since it can depends of the context (if the two functions are supposed to return the same thing, you can actually check the outputs and additionally check the theoretical result).

@agramfort
Copy link
Member

I still think that for community coordination it's better to fix in statsmodels than doing it here ...

cc @josef-pkt thoughts?

@josef-pkt
Copy link

I just want to point out that statsmodel does not even give the correct answer for coefficients, and their standard deviations are therefore almost meaningless,

a bug report to statsmodels would be helpful.
However, our test coverage is pretty good, so I don't think this is the case. There might be different defaults or options.

What's the scope of scikit-learn is not my issue.
However, standard errors for parameter estimates and inference are only relatively straight forward in the traditional statistics cases. Inference after regularization and for machine learning methods with variable selection are a difficult issue. For example, for LASSO there has been only theoretical developments in recent years to derive appropriate standard errors and inference. The usual standard errors and confidence intervals don't take regularization and variable selection into account, but often those "simplified" standard errors are all we have.
The plan is to improve and include better or more inference for those cases in statsmodels in the future, but it's slow going.

The only generic approach would be to compute bootstrap standard errors or inference over the entire algorithm, but that still relies on choosing the right bootstrap if either heteroscedasticity or correlation between observations is present. (It also might not be valid for all algorithms, but I don't have an overview there.)

@josef-pkt
Copy link

I checked the test case from this PR with statsmodels OLS and WLS

for OLS absolute and relative difference

>>> res.bse - [2.10596119117, 3.97159995702e-05, 1.48683754360e-06]
array([  4.13202805e-11,   8.17535872e-16,   2.63994759e-17])
>>> res.bse / [2.10596119117, 3.97159995702e-05, 1.48683754360e-06] - 1
array([  1.96205274e-11,   2.05846451e-11,   1.77553527e-11])

for WLS absolute and relative difference

>>> resw.bse - [6171.11372672, 45032.21987029]
array([ -1.07138476e-09,   1.31694833e-09])
>>> resw.bse / [6171.11372672, 45032.21987029] - 1
array([ -1.73638881e-13,   2.93098879e-14])

There are differences at float64 precision between packages because different linear algebra functions are used, but SVD, that both scikit-learn and statsmodels use, is supposed to be numerically the most stable, most likely more stable than pivoting QR that base R uses.

@josef-pkt
Copy link

And to illustrate:
We haven't had any complaints for linear regression except for the singular design matrix case. The use of pinv/SVD hasn't changed since the beginning of statsmodels.

Here is one of my investigation into edge cases for linear regression from 2012
https://jpktd.blogspot.ca/2012/03/numerical-accuracy-in-linear-least.html
using the NIST test case that caused the most problems for statistical packages.

@jl2922
Copy link
Author

jl2922 commented May 15, 2017

It turns out that I forgot to add constants to x when I was trying statsmodel since in other stats packages that I have used the default is with the constant. Sorry for the wrong message and I appreciate the usage of SVD @josef-pkt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants