bpo-38490: statistics: Add covariance, Pearson's correlation, and simple linear regression #16813

twolodzko · 2019-10-15T21:46:37Z

This PR adds functions for calculating bivariate covariance and Pearson's correlation.

https://bugs.python.org/issue38490

the-knights-who-say-ni · 2019-10-15T21:46:40Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

Recognized GitHub username

We couldn't find a bugs.python.org (b.p.o) account corresponding to the following GitHub usernames:

@twolodzko

This might be simply due to a missing "GitHub Name" entry in one's b.p.o account settings. This is necessary for legal reasons before we can look at this contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

…ix-issue-38490

corona10 · 2019-10-18T04:51:46Z

I'd like to recommend you to open the PR after the discussion on https://bugs.python.org/issue38490 is finalized. :)

Doc/library/statistics.rst

rhettinger · 2020-10-09T07:51:59Z

Lib/statistics.py

@@ -910,10 +910,13 @@ def correlation(x, y, /):
        raise StatisticsError('at least one of the inputs is constant')


-def linear_regression(regressor, dependent_variable):
+LinearRegression = namedtuple('LinearRegression', ['intercept', 'slope'])


The parameter order seems backwards to me, here and through out the PR. Traditionally, we say an equation for a line in is slope-intercept form as y = mx + b with the m preceding the b.

It depends on your background, in statistics the model is often written as y = β_0 + β_1 * x + ε , so the intercept is written before. Also, if now it is a named tuple, the order is of less importance. Of course, if you insist, I may change it.

stevendaprano · 2020-10-09T09:24:34Z

Thank you Tymoteusz for your work. I agree with Raymond that using a namedtuple is a good idea here.

On Fri, Oct 09, 2020 at 12:52:16AM -0700, Raymond Hettinger wrote: The parameter order seems backwards to me, here and through out the PR. Traditionally, we say an equation for a line in is slope-intercept form as `y = mx + b` with the *m* preceding the *b*.

This is true for most areas of mathematics except for statistics and finance. In those, the usual form of the linear regression equation is: y = a + bx which as Raymond points out is the reverse of the usual form: y = mx + c (or sometimes `mx + b` for the constant term). See, for example: https://www.statisticssolutions.com/conduct-interpret-linear-regression/ https://www.investopedia.com/terms/r/regression.asp

…

-- Steve

stevendaprano · 2020-10-09T09:29:39Z

On Fri, Oct 09, 2020 at 01:15:52AM -0700, Tymoteusz Wołodźko wrote: Also, if now it is a named tuple, the order is of less importance. Of course, if you insist, I may change it.

I would prefer you keep the (intercept, gradient) order which is more conventional in statistics.

twolodzko · 2020-10-19T08:05:06Z

@stevendaprano @rhettinger should I apply any more changes, or is it ok as-is?

taleinat · 2020-10-19T09:18:34Z

The only current outstanding issue is the order of returned values from linear_regression().

ISTM given the short discussion here that the order of (intercept, gradient) could go either way, and that in the context of statistics the current order seems fine. And since this will be a named tuple, the order is not as important as it would be otherwise. As long as the documentation and doc-string are consistent and clear on this, which they are, IMO this is fine.

I'd like to get @rhettinger's approval, or otherwise, for this before moving forward.

stevendaprano · 2020-10-19T10:13:21Z

Given that this is the statistics module, not linear algebra, I think we ought to stick to the common convention taught in stats classes, which is y = a + b x i.e. (intercept, gradient) in that order. Unless Raymond strongly objects, I say go with this order. Do we have time to add a new feature to this before feature-freeze? I would like to add two methods to the tuple returned: def predict_y(self, x): """Return the predicted y value for the given x. Returns the predicted response or dependent variable. """ return self.intercept + self.gradient*x def predict_x(self, y): """Return the predicted x value for the given y. Returns the predicted explanatory or independent variable. """ return (y - self.intercept)/self.gradient Predicting the x or y values from the linear regression line are very common operations, I think it would be very useful to offer them without expecting users to create the function themselves.

…

-- Steve

twolodzko · 2020-10-19T10:24:24Z

@stevendaprano I'd be against having predict_x as this is incorrect if this is a statistical model y = a + bx + noise, because it ignores the noise.

twolodzko · 2020-11-02T11:30:12Z

It's been two weeks and the discussion seems to got stuck. Is there anything more I should change about the PR?

ZackerySpytz · 2020-11-02T11:44:49Z

Please add an entry to Doc/whatsnew/3.10.rst and add versionadded directives to the documentation for the new functions.

taleinat · 2020-11-02T12:54:58Z

@stevendaprano I'd be against having predict_x as this is incorrect if this is a statistical model y = a + bx + noise, because it ignores the noise.

I'm also -1 on adding .predict_{x,y} methods. I feel that they would be beyond the scope of what this module sets out to achieve, and they'd also be surprising for many to find on a named tuple.

Doc/whatsnew/3.10.rst

taleinat · 2020-11-02T16:16:11Z

Apart from my minor suggestion for the What's New entry, this looks good to me.

@rhettinger, would you like to take another look? Any other comments?

Co-authored-by: Tal Einat <taleinat+github@gmail.com>

Doc/library/statistics.rst

Misc/NEWS.d/next/Library/2019-10-16-08-08-14.bpo-38490.QbDXEF.rst

stevendaprano

@twolodzko I am very impressed in this, thank you, and I look forward to using it. I just commented on what looks like a broken piece of ReST and a slightly unusual choice of wording, but apart from this I am very happy with your patch and I think it should be merged.

twolodzko · 2020-11-03T07:34:22Z

If I'm not missing something, all issues seem to be resolved right now.

taleinat · 2020-11-03T07:37:57Z

If I'm not missing something, all issues seem to be resolved right now.

Indeed.

ZackerySpytz · 2020-11-26T09:21:18Z

Unfortunately, it seems like there are some minor grammatical errors in the new docstrings and documentation.

bedevere-bot · 2021-04-25T11:45:11Z

@taleinat: Please replace # with GH- in the commit message next time. Thanks!

taleinat · 2021-04-25T11:45:55Z

Merged!

This has indeed reached a very good state a while ago and was reviewed favorably by several core devs. Definitely good to have this for 3.10.

Add covariance and Pearson's correlation

707ce10

the-knights-who-say-ni added the CLA not signed label Oct 15, 2019

bedevere-bot added the awaiting review label Oct 15, 2019

Tim and others added 14 commits October 15, 2019 23:47

Fix doctest

fab1dab

Bugfixes

c06839b

Bugfix in doctest

485f55f

Fix doctest

3707299

Improve documentation

5097509

📜🤖 Added by blurb_it.

2dbabdb

Update docstring

411f2c0

Merge branch 'fix-issue-38490' of github.com:twolodzko/cpython into f…

09b0dab

…ix-issue-38490

Improved exceptions handling

4ed02eb

Updated Misc/ACKS

917f9af

Fix exception handling

30c2842

Update __all__

1b9c385

Add linear_regression

b3afc63

Improve docstrings

e29e9d1

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Oct 16, 2019

Tymoteusz Wołodźko added 4 commits October 17, 2019 08:10

Better naming of arguments in linear_regression

b3c726e

Fix docstring

256de8d

Improve code formatting

f44c468

Documentation for the new functionalities

117f567

twolodzko changed the title ~~bpo-38490: Add covariance and Pearson's correlation~~ bpo-38490: Add covariance, Pearson's correlation, and simple linear regression Oct 22, 2019

twolodzko changed the title ~~bpo-38490: Add covariance, Pearson's correlation, and simple linear regression~~ bpo-38490: statistics: Add covariance, Pearson's correlation, and simple linear regression Oct 22, 2019

taleinat requested changes Nov 13, 2019

View reviewed changes

Doc/library/statistics.rst Outdated Show resolved Hide resolved

Doc/library/statistics.rst Outdated Show resolved Hide resolved

bedevere-bot added awaiting changes and removed awaiting review labels Nov 13, 2019

rhettinger reviewed Oct 9, 2020

View reviewed changes

Tymoteusz Wołodźko added 2 commits November 2, 2020 13:04

Merge branch 'master' into fix-issue-38490

7784565

Update version information

9bf170d

taleinat reviewed Nov 2, 2020

View reviewed changes

Doc/whatsnew/3.10.rst Outdated Show resolved Hide resolved

Update Doc/whatsnew/3.10.rst

48a2bea

Co-authored-by: Tal Einat <taleinat+github@gmail.com>

stevendaprano reviewed Nov 3, 2020

View reviewed changes

Doc/library/statistics.rst Outdated Show resolved Hide resolved

stevendaprano reviewed Nov 3, 2020

View reviewed changes

Misc/NEWS.d/next/Library/2019-10-16-08-08-14.bpo-38490.QbDXEF.rst Show resolved Hide resolved

stevendaprano reviewed Nov 3, 2020

View reviewed changes

Fixed markup in linear regression description

37b742e

Merge branch 'master' into fix-issue-38490

aa4805f

twolodzko requested review from rhettinger and stevendaprano February 24, 2021 17:54

taleinat merged commit 09aa6f9 into python:master Apr 25, 2021

bedevere-bot removed the awaiting merge label Apr 25, 2021

ghost mentioned this pull request Jul 22, 2022

statistics.covariance, statistics.correlation and statistics.linear_regression could accept decimal.Decimal inputs. #95130

Open

Uh oh!

bpo-38490: statistics: Add covariance, Pearson's correlation, and simple linear regression #16813

bpo-38490: statistics: Add covariance, Pearson's correlation, and simple linear regression #16813

Uh oh!

Conversation

twolodzko commented Oct 15, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-knights-who-say-ni commented Oct 15, 2019

Recognized GitHub username

Uh oh!

corona10 commented Oct 18, 2019

Uh oh!

Uh oh!

Uh oh!

rhettinger Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

twolodzko Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

stevendaprano commented Oct 9, 2020 via email

Uh oh!

stevendaprano commented Oct 9, 2020 via email

Uh oh!

twolodzko commented Oct 19, 2020

Uh oh!

taleinat commented Oct 19, 2020

Uh oh!

stevendaprano commented Oct 19, 2020 via email

Uh oh!

twolodzko commented Oct 19, 2020

Uh oh!

twolodzko commented Nov 2, 2020

Uh oh!

ZackerySpytz commented Nov 2, 2020

Uh oh!

taleinat commented Nov 2, 2020

Uh oh!

Uh oh!

taleinat commented Nov 2, 2020

Uh oh!

Uh oh!

Uh oh!

stevendaprano left a comment

Choose a reason for hiding this comment

Uh oh!

twolodzko commented Nov 3, 2020

Uh oh!

taleinat commented Nov 3, 2020

Uh oh!

ZackerySpytz commented Nov 26, 2020

Uh oh!

bedevere-bot commented Apr 25, 2021

Uh oh!

taleinat commented Apr 25, 2021

Uh oh!

Uh oh!

twolodzko commented Oct 15, 2019 •

edited by bedevere-bot

Loading