`statistics.covariance`, `statistics.correlation` and `statistics.linear_regression` could accept `decimal.Decimal` inputs.

**Bug report**

From the [statistics documentation](https://docs.python.org/3/library/statistics.html):

> Unless explicitly noted, these functions support int, float, Decimal and Fraction.

However, this does not hold for the new (3.10) 'Statistics for relations between two inputs': `covariance`, `correlation`, and `linear_regression`

```python3
from decimal import Decimal
import statistics

x = [Decimal(number) for number in [1, 2, 3, 4, 5, 6, 7, 8, 9]]
y = [Decimal(number) for number in [1, 2, 3, 1, 2, 3, 1, 2, 3]]

statistics.covariance(x, y)
statistics.correlation(x, y)
statistics.linear_regression(x, y)
```

All three raise `TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'` because some intermediate values are computed as float, which cannot be combined with Decimal in arithmetic operations.

(int and fraction can be combined with float in arithmetic operations, so this case does not raise a `TypeError`, but the result is always float, which may or may not be desirable)

**Expected fix**
I see three ways to address this:

 - change documentation to explicitly make `correlation` et al. float-only.
 - convert all inputs to float and always return float, like `geometric_mean`.
 - rewrite to return the same data type as inputs, like `mean`.

The first seems to go against the design rationale of the statistics module, but apparently no-one had any problem with float-only until now, so this may be fine. Otherwise the second would probably be easiest to implement. For the last option,

The original implementation #16813 almost worked for Decimal; by changing `covariance` to use `statistics._sum` instead of `math.fsum` and converting the result, all three would return Decimal for Decimal inputs (`correlation` returns float for Fractions because there's not really a sensible way to handle Fraction square roots):

```python3
def covariance(x, y, /):
    """ doc... """
    n = len(x)
    if len(y) != n:
        raise StatisticsError('covariance requires that both inputs have same number of data points')
    if n < 2:
        raise StatisticsError('covariance requires at least two data points')
    xbar = mean(x)
    ybar = mean(y)
    T, total, _ = _sum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))
    return _convert(total / (n - 1), T)
```

However,  some optimizations (#26135) have since been made based on the float-only assumption which would need to be reevaluated.
Applying the same `fsum` <-> `_sum` replacement to the optimized versions leads to ~10x slowdown on my machine, which makes it only ~3x faster than using the original implementation. There's probably a faster way to do it, and I haven't investigated potential accuracy differences, this just shows fully handling Decimal will have a significant but perhaps acceptable performance cost.

```python3
from collections import namedtuple
from statistics import StatisticsError, _sum, _coerce, _convert
from math import sqrt

def _sqrt(x, /):
    """ special-case square root for Decimal. Note Fraction and int are still converted to float. """
    try:
        return x.sqrt()
    except AttributeError:
        return sqrt(x)

def covariance(x, y, /):
    """ doc... """
    xT, xtotal, xn = _sum(x)
    yT, ytotal, yn = _sum(y)
    if xn != yn:
        raise StatisticsError('covariance requires that both inputs have same number of data points')
    if xn < 2:
        raise StatisticsError('covariance requires at least two data points')
    T = _coerce(xT, yT)
    xbar = _convert(xtotal / xn, T)
    ybar = _convert(ytotal / xn, T)
    _, sxy, _ = _sum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))
    
    return _convert(sxy / (xn - 1), T)


def correlation(x, y, /):
    """ doc... """
    xT, xtotal, xn = _sum(x)
    yT, ytotal, yn = _sum(y)
    if xn != yn:
        raise StatisticsError('correlation requires that both inputs have same number of data points')
    if xn < 2:
        raise StatisticsError('correlation requires at least two data points')
    T = _coerce(xT, yT)
    xbar = _convert(xtotal / xn, T)
    ybar = _convert(ytotal / xn, T)
    _, sxy, _ = _sum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))
    _, sxx, _ = _sum((d := xi - xbar) * d for xi in x)
    _, syy, _ = _sum((d := yi - ybar) * d for yi in y)
    try:
        return _convert(sxy / _sqrt(sxx * syy), T)
    except ZeroDivisionError:
        raise StatisticsError('at least one of the inputs is constant')


LinearRegression = namedtuple('LinearRegression', ('slope', 'intercept'))


def linear_regression(x, y, /, *, proportional=False):
    """ doc... """
    xT, xtotal, xn = _sum(x)
    yT, ytotal, yn = _sum(y)
    if xn != yn:
        raise StatisticsError('correlation requires that both inputs have same number of data points')
    if xn < 2:
        raise StatisticsError('correlation requires at least two data points')
    T = _coerce(xT, yT)

    if proportional:
        _, sxy, _ = _sum(xi * yi for xi, yi in zip(x, y))
        _, sxx, _ = _sum(xi * xi for xi in x)
    else:
        xbar = _convert(xtotal / xn, T)
        ybar = _convert(ytotal / xn, T)
        _, sxy, _ = _sum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))
        _, sxx, _ = _sum((d := xi - xbar) * d for xi in x)
    try:
        slope = _convert(sxy / sxx, T)   # equivalent to:  covariance(x, y) / variance(x)
    except ZeroDivisionError:
        raise StatisticsError('x is constant')
    intercept = _convert(0.0 if proportional else ybar - slope * xbar, T)
    return LinearRegression(slope=slope, intercept=intercept)
```


**Environment**
- CPython versions tested on: 3.10.5 and with [original Lib/statistics.py](https://github.com/python/cpython/blob/f5e987bbe64156ebeae5eea730962c209fbb9d74/Lib/statistics.py) and [current latest Lib/statistics.py](https://github.com/python/cpython/blob/834bd5dd766cf212fb20d65d8a046c62a33006d4/Lib/statistics.py)
- Operating system and architecture: 64-bit Ubuntu 20.04.4 on Intel Core i5-7200U (x86)

**Bonus Enhancement**
If rewriting, maybe make `covariance`, `correlation`, and `linear_regression` accept iterators too, as all other `statistics` functions do even when no single-pass algorithm is available, e.g. `harmonic_mean`:
```python3
def harmonic_mean(data, weights=None):
    if iter(data) is data:
        data = list(data)
    ...
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`statistics.covariance`, `statistics.correlation` and `statistics.linear_regression` could accept `decimal.Decimal` inputs. #95130

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

statistics.covariance, statistics.correlation and statistics.linear_regression could accept decimal.Decimal inputs. #95130

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`statistics.covariance`, `statistics.correlation` and `statistics.linear_regression` could accept `decimal.Decimal` inputs. #95130