Skip to content

Inconsistent return type in statistics.median_grouped() #92531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhettinger opened this issue May 9, 2022 · 2 comments
Closed

Inconsistent return type in statistics.median_grouped() #92531

rhettinger opened this issue May 9, 2022 · 2 comments
Labels
3.11 only security fixes type-bug An unexpected behavior, bug, or error

Comments

@rhettinger
Copy link
Contributor

The median_grouped() function erroneously returns inconsistent types in its special case path for a single argument. The function should always return a float.

>>> median_grouped([Fraction(2, 3), Fraction(3, 5)])
0.16666666666666663
>>>median_grouped([Fraction(2, 3)])
Fraction(2, 3)

>>> median_grouped([Decimal(5.5), Decimal(2.5)])
5.0
>>> median_grouped([Decimal(5.5)])
Decimal('5.5')
@rhettinger rhettinger added the type-bug An unexpected behavior, bug, or error label May 9, 2022
@rhettinger rhettinger added the 3.11 only security fixes label May 11, 2022
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 11, 2022
…-92531 (pythonGH-92533)

(cherry picked from commit e01eeb7)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
@stevendaprano
Copy link
Member

The function should always return a float.

@rhettinger I'm not going to ask for this change to be reverted. I think that so long as the function does convert to float, you are right that it should do so consistently. But this function is not intended to always return floats. The documentation says:

CPython implementation detail: Under some circumstances, median_grouped() may coerce data points to floats. This behaviour is likely to change in the future.

The intent is that if your data is homogeneous Decimals or Fractions, it should honour those types, not coerce them into floats. (Obviously the function doesn't do that at the moment.)

The problem is that the data stream (list, iterator etc) may not be homogeneous and I'm not sure what how to handle the case of heterogeneous data. The other statistics functions mostly try very hard to coerce heterogeneous types into a common type, but that has performance costs and I don't think it is appropriate for the median functions.

I have some ideas regarding that, e.g. to single out the first data value as the canonical type, and then go on with assuming the rest of the data values are compatible. But until the behaviour changes, I concede that your change is an improvement.

@rhettinger
Copy link
Contributor Author

Except for the inconsistent special case, the code has always returned a float, regardless of input type for almost a decade. There was even a test to assure that was the case. The comment in the test indicate that the original implementers gave-up on trying to make median_grouped() preserve the input types. Since then, no one has touched the code or tests.

Though it's risky at this point, the implementation note in the docs leaves the door open for change. To follow through with the original intent, make the following edit:

diff --git a/Lib/statistics.py b/Lib/statistics.py
index 2d66b0522f..99c7b8aea8 100644
--- a/Lib/statistics.py
+++ b/Lib/statistics.py
@@ -611,7 +611,7 @@ def median_high(data):
     return data[n // 2]
 
 
-def median_grouped(data, interval=1.0):
+def median_grouped(data, interval=1):
     """Estimates the median for numeric data binned around the midpoints
     of consecutive, fixed-width intervals.
 
@@ -668,19 +668,10 @@ def median_grouped(data, interval=1.0):
     i = bisect_left(data, x)
     j = bisect_right(data, x, lo=i)
 
-    # Coerce to floats, raising a TypeError if not possible
-    try:
-        interval = float(interval)
-        x = float(x)
-    except ValueError:
-        raise TypeError(f'Value cannot be converted to a float')
-
-    # Interpolate the median using the formula found at:
-    # https://www.cuemath.com/data/median-of-grouped-data/
-    L = x - interval / 2.0    # Lower limit of the median interval
-    cf = i                    # Cumulative frequency of the preceding interval
-    f = j - i                 # Number of elements in the median internal
-    return L + interval * (n / 2 - cf) / f
+    L2 = x + x - interval       # Twice the lower limit of the median interval
+    cf = i                      # Cumulative frequency of the preceding interval
+    f = j - i                   # Number of elements in the median internal
+    return (L2 * f + interval * (n - cf - cf)) / (f + f)

 def mode(data):

That gives a run like this:

>>> median_grouped([Fraction(2, 3)])
Fraction(2, 3)
>>> median_grouped([Fraction(2, 3)]*3)
Fraction(2, 3)
>>> median_grouped([Fraction(2, 3), 31]*3)
30.5
>>> median_grouped([Fraction(2, 3), Fraction(3, 5)]*3)
Fraction(1, 6)
>>> median_grouped([Fraction(2, 3)]* 2 + [Fraction(3, 5)]*3)
Fraction(14, 15)
>>> median_grouped([Decimal('2.2')]* 2 + [Decimal('1.1')]*3)
Decimal('1.433333333333333333333333333')

If an inconsistent type is given for interval, then a TypeError is raised:

>>> median_grouped([Decimal('2.2')]* 2 + [Decimal('1.1')]*3, 0.5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/raymond/cpython/Lib/statistics.py", line 671, in median_grouped
    L2 = x + x - interval       # Twice the lower limit of the median interval
         ~~~~~~^~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

If the input data has inconsistent types, that is ignored. Only the value at the midpoint is used (as one would expect from a median calculation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants