-
-
Notifications
You must be signed in to change notification settings - Fork 26k
add standard deviation calculation for linear regression #8872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Several tests in sklearn/linear_model/tests/test_logistic.py appears flanky. Shall we disable these? |
Not really sure what do you mean by "flanky". The tests are here to ensure that the API does not get broken. In this regard, it seems that they are doing their jobs right now ;) |
this feature has been discussed in the past and it was decided that people should use statsmodels for this, not sklearn which targets prediction. if we still agree on this I am -1 on this PR. |
@glemaitre Sorry, I meant 'flaky'. If you run these tests with the latest merged commit, they will also fail sometimes. All the failed ones relate to comparing the results from two different methods which both do not give the accurate answer. And I don't think it is not a good practice compare results from two functions in tests, but rather each of them shall be compared with a theoretically correct result. |
@agramfort Regarding whether this feature is suitable or necessary, I just want to point out that statsmodel does not even give the correct answer for coefficients, and their standard deviations are therefore almost meaningless, which is why I ended up wrote my own version for my work, which requires serious statistical results. At least I couldn't find any packages online that offers all of the following features:
|
If you can raise an issue with those ones such that we can reproduce it and change the tests accordingly it will be great.
I partially agree since it can depends of the context (if the two functions are supposed to return the same thing, you can actually check the outputs and additionally check the theoretical result). |
I still think that for community coordination it's better to fix in statsmodels than doing it here ... cc @josef-pkt thoughts? |
a bug report to statsmodels would be helpful. What's the scope of scikit-learn is not my issue. The only generic approach would be to compute bootstrap standard errors or inference over the entire algorithm, but that still relies on choosing the right bootstrap if either heteroscedasticity or correlation between observations is present. (It also might not be valid for all algorithms, but I don't have an overview there.) |
I checked the test case from this PR with statsmodels OLS and WLS for OLS absolute and relative difference
for WLS absolute and relative difference
There are differences at float64 precision between packages because different linear algebra functions are used, but SVD, that both scikit-learn and statsmodels use, is supposed to be numerically the most stable, most likely more stable than pivoting QR that base R uses. |
And to illustrate: Here is one of my investigation into edge cases for linear regression from 2012 |
It turns out that I forgot to add constants to x when I was trying statsmodel since in other stats packages that I have used the default is with the constant. Sorry for the wrong message and I appreciate the usage of SVD @josef-pkt |
Standard deviation gives the half width of 68% confidence interval of the estimated coefficients. This PR adds standard deviation calculation for the coefficients in the linear regression model.
The computational method is described here.
#8870
@glemaitre