-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
Describe the bug
When load_diabetes
is called with scaled=False
, the s5
attribute has some values with insufficient precision:
All values should stay equal when rounded to 4 decimals, but 11 of them don't.
This is caused by the fact that the unpacked sklearn/datasets/data/diabetes_data_raw.csv.gz
has some numeric differences to the original data.
E.g. entry nr. 147 contains 4.803999999999999
here (in line 147), and 4.804
in the original (line 148 because of header).
The following example shows different behavior when the data source is toggled with use_internal
.
I need the correct data because I have code that tries to autodetect the precision – which currently cannot detect the correct precision of s5
.
Steps/Code to Reproduce
from sklearn.datasets import load_diabetes
import requests
from io import StringIO
import pandas as pd
use_internal = True
if use_internal:
diabetes = load_diabetes(as_frame=True, scaled=False)
s5 = diabetes.frame['s5']
else:
# diabetes.DESCR names https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
# as source URL.
# There the following orig_url is linked as the original data set.
orig_url = 'https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt'
response = requests.get(orig_url)
response.raise_for_status()
data = StringIO(response.text)
diabetes = pd.read_csv(data, sep='\t')
s5 = diabetes['S5']
rounded = s5.round(4)
pd.set_option('display.precision', 16)
diff = s5[s5 != rounded]
print(diff)
assert(diff.empty)
Expected Results
Series([], Name: S2, dtype: float64)
and no assertion.
(As with use_internal = False
)
Actual Results
146 4.8039999999999994
239 5.3660000000000005
265 4.8039999999999994
303 5.4510000000000005
313 5.2470000000000008
324 5.3660000000000005
359 4.8039999999999994
364 4.8039999999999994
410 5.3660000000000005
415 4.8039999999999994
428 5.3660000000000005
Name: s5, dtype: float64
and the diff.empty
assertion.
(In contrast to use_internal = False
)
Versions
System:
python: 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]
executable: /home/robert/projects/lcm-stp/fuchs23/env312/bin/python3
machine: Linux-6.14.0-27-generic-x86_64-with-glibc2.39
Python dependencies:
sklearn: 1.5.1
pip: None
setuptools: 75.8.0
numpy: 2.1.1
scipy: 1.14.1
Cython: None
pandas: 2.2.2
matplotlib: 3.9.2
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 32
prefix: libscipy_openblas
filepath: /home/robert/projects/lcm-stp/fuchs23/env312/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-ff651d7f.so
version: 0.3.27
threading_layer: pthreads
architecture: SkylakeX
user_api: blas
internal_api: openblas
num_threads: 32
prefix: libscipy_openblas
filepath: /home/robert/projects/lcm-stp/fuchs23/env312/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-c128ec02.so
version: 0.3.27.dev
threading_layer: pthreads
architecture: Cooperlake
user_api: openmp
internal_api: openmp
num_threads: 32
prefix: libgomp
filepath: /home/robert/projects/lcm-stp/fuchs23/env312/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None