-
-
Notifications
You must be signed in to change notification settings - Fork 32.7k
bpo-38490: statistics: Add covariance, Pearson's correlation, and simple linear regression #16813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA). Recognized GitHub usernameWe couldn't find a bugs.python.org (b.p.o) account corresponding to the following GitHub usernames: This might be simply due to a missing "GitHub Name" entry in one's b.p.o account settings. This is necessary for legal reasons before we can look at this contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. You can check yourself to see if the CLA has been received. Thanks again for the contribution, we look forward to reviewing it! |
I'd like to recommend you to open the PR after the discussion on https://bugs.python.org/issue38490 is finalized. :) |
@@ -910,10 +910,13 @@ def correlation(x, y, /): | |||
raise StatisticsError('at least one of the inputs is constant') | |||
|
|||
|
|||
def linear_regression(regressor, dependent_variable): | |||
LinearRegression = namedtuple('LinearRegression', ['intercept', 'slope']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter order seems backwards to me, here and through out the PR. Traditionally, we say an equation for a line in is slope-intercept form as y = mx + b
with the m preceding the b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on your background, in statistics the model is often written as y = β_0 + β_1 * x + ε , so the intercept is written before. Also, if now it is a named tuple, the order is of less importance. Of course, if you insist, I may change it.
Thank you Tymoteusz for your work.
I agree with Raymond that using a namedtuple is a good idea here.
On Fri, Oct 09, 2020 at 12:52:16AM -0700, Raymond Hettinger wrote:
The parameter order seems backwards to me, here and through out the
PR. Traditionally, we say an equation for a line in is
slope-intercept form as `y = mx + b` with the *m* preceding the *b*.
This is true for most areas of mathematics except for statistics and
finance. In those, the usual form of the linear regression equation is:
y = a + bx
which as Raymond points out is the reverse of the usual form:
y = mx + c
(or sometimes `mx + b` for the constant term).
See, for example:
https://www.statisticssolutions.com/conduct-interpret-linear-regression/
https://www.investopedia.com/terms/r/regression.asp
…--
Steve
|
On Fri, Oct 09, 2020 at 01:15:52AM -0700, Tymoteusz Wołodźko wrote:
Also, if now it is a named tuple, the order is of
less importance. Of course, if you insist, I may change it.
I would prefer you keep the (intercept, gradient) order which is more
conventional in statistics.
|
@stevendaprano @rhettinger should I apply any more changes, or is it ok as-is? |
The only current outstanding issue is the order of returned values from ISTM given the short discussion here that the order of I'd like to get @rhettinger's approval, or otherwise, for this before moving forward. |
Given that this is the statistics module, not linear algebra, I think we
ought to stick to the common convention taught in stats classes, which
is
y = a + b x
i.e. (intercept, gradient) in that order. Unless Raymond strongly
objects, I say go with this order.
Do we have time to add a new feature to this before feature-freeze? I
would like to add two methods to the tuple returned:
def predict_y(self, x):
"""Return the predicted y value for the given x.
Returns the predicted response or dependent variable.
"""
return self.intercept + self.gradient*x
def predict_x(self, y):
"""Return the predicted x value for the given y.
Returns the predicted explanatory or independent variable.
"""
return (y - self.intercept)/self.gradient
Predicting the x or y values from the linear regression line are very
common operations, I think it would be very useful to offer them without
expecting users to create the function themselves.
…--
Steve
|
@stevendaprano I'd be against having |
It's been two weeks and the discussion seems to got stuck. Is there anything more I should change about the PR? |
Please add an entry to |
I'm also -1 on adding |
Apart from my minor suggestion for the What's New entry, this looks good to me. @rhettinger, would you like to take another look? Any other comments? |
Co-authored-by: Tal Einat <taleinat+github@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twolodzko I am very impressed in this, thank you, and I look forward to using it. I just commented on what looks like a broken piece of ReST and a slightly unusual choice of wording, but apart from this I am very happy with your patch and I think it should be merged.
If I'm not missing something, all issues seem to be resolved right now. |
Indeed. |
Unfortunately, it seems like there are some minor grammatical errors in the new docstrings and documentation. |
@taleinat: Please replace |
Merged! This has indeed reached a very good state a while ago and was reviewed favorably by several core devs. Definitely good to have this for 3.10. |
This PR adds functions for calculating bivariate covariance and Pearson's correlation.
https://bugs.python.org/issue38490