Skip to content

Diabetes data should match the original source #31940

@jondo

Description

@jondo

Describe the bug

When load_diabetes is called with scaled=False, the s5 attribute has some values with insufficient precision:
All values should stay equal when rounded to 4 decimals, but 11 of them don't.

This is caused by the fact that the unpacked sklearn/datasets/data/diabetes_data_raw.csv.gz has some numeric differences to the original data.
E.g. entry nr. 147 contains 4.803999999999999 here (in line 147), and 4.804 in the original (line 148 because of header).

The following example shows different behavior when the data source is toggled with use_internal.

I need the correct data because I have code that tries to autodetect the precision – which currently cannot detect the correct precision of s5.

Steps/Code to Reproduce

from sklearn.datasets import load_diabetes
import requests
from io import StringIO
import pandas as pd

use_internal = True
if use_internal:
    diabetes = load_diabetes(as_frame=True, scaled=False)
    s5 = diabetes.frame['s5']
else:
    # diabetes.DESCR names https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
    # as source URL.
    # There the following orig_url is linked as the original data set.
    orig_url = 'https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt'
    response = requests.get(orig_url)
    response.raise_for_status()
    data = StringIO(response.text)
    diabetes = pd.read_csv(data, sep='\t')
    s5 = diabetes['S5']

rounded = s5.round(4)
pd.set_option('display.precision', 16)
diff = s5[s5 != rounded] 
print(diff)
assert(diff.empty)

Expected Results

Series([], Name: S2, dtype: float64)

and no assertion.

(As with use_internal = False)

Actual Results

146    4.8039999999999994
239    5.3660000000000005
265    4.8039999999999994
303    5.4510000000000005
313    5.2470000000000008
324    5.3660000000000005
359    4.8039999999999994
364    4.8039999999999994
410    5.3660000000000005
415    4.8039999999999994
428    5.3660000000000005
Name: s5, dtype: float64

and the diff.empty assertion.

(In contrast to use_internal = False)

Versions

System:
    python: 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]
executable: /home/robert/projects/lcm-stp/fuchs23/env312/bin/python3
   machine: Linux-6.14.0-27-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.5.1
          pip: None
   setuptools: 75.8.0
        numpy: 2.1.1
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.9.2
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 32
         prefix: libscipy_openblas
       filepath: /home/robert/projects/lcm-stp/fuchs23/env312/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-ff651d7f.so
        version: 0.3.27
threading_layer: pthreads
   architecture: SkylakeX

       user_api: blas
   internal_api: openblas
    num_threads: 32
         prefix: libscipy_openblas
       filepath: /home/robert/projects/lcm-stp/fuchs23/env312/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-c128ec02.so
        version: 0.3.27.dev
threading_layer: pthreads
   architecture: Cooperlake

       user_api: openmp
   internal_api: openmp
    num_threads: 32
         prefix: libgomp
       filepath: /home/robert/projects/lcm-stp/fuchs23/env312/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions