Description
Describe the bug
I am currently using scikit-learn
classifier feature_importances_
attribute on a project to rank important features from my model, and my CI
pipeline runs the project test-suite using instances of scikit-learn==1.3.2
and scikit-learn==1.5.2
on a remote linux host. I am experiencing some discrepancies in the output of the relevant test (for which I have provided a minimal viable reproducer below) on different machines/installations/sklearn versions.
There are a few specific problems I am experiencing:
- Locally, the test will pass using a binary installation of
scikit-learn==1.3.2
and fail usingscikit-learn==1.5.2
. With the help of my team, we have traced this error back and found the earliest failing version to be1.4.1.post1
. We suspect that the error originates from a change made in FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639 that has to do with the switch from absolute counts to store proportions intree_.values
but have not determined a root cause for the discrepancy. - As mentioned in (1) when running the test-suite locally on my
Mac-ARM64
machine, the test will fail as described, however, when running the test on a remote linux machine, the test will pass with both sklearn versions - The test will fail when I build the code from source vs. from the binary distribution of
scikit-learn==1.3.2
My main question is, what could be the cause of these observed discrepancies between sklearn version, installation type and environment and which output is most "correct"?
Steps/Code to Reproduce
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from pandas.testing import assert_frame_equal
import pdb
# this test serves as a minimal viable reproducer for the
# difference observed in output of tree values between
# sklearn versions 1.3.2 and 1.4.2. this test should pass
# when using sklearn==1.3.2 and fail when using sklearn==1.4.2
# first create a minimal dataset of values for training
# a Random Forest Classifier
random_state = 123
rng = np.random.default_rng(random_state)
X = rng.integers(0, 2, size=(1000, 12))
y = np.asarray([1, 0] * 500)
y[-1] = 1
X[:, 0] = y
X[:, 1] = 1 - y
clfr = RandomForestClassifier(n_estimators=100, random_state=random_state)
clfr.fit(X, y)
# find the importances of the estimator and check that ranking of the importances
importances_out = clfr.feature_importances_
# these are the importance values that are expected from ``sklearn==1.3.2``
importances_exp = np.array(
[
0.52090464,
0.46263368,
0.00115268,
0.00179985,
0.00177495,
0.00169134,
0.00157653,
0.00135364,
0.00175814,
0.00169148,
0.00162767,
0.00203539,
]
)
importances_out = pd.DataFrame(importances_out).sort_values(by=0)
importances_exp = pd.DataFrame(importances_exp).sort_values(by=0)
assert_frame_equal(importances_out, importances_exp)
Expected Results
The expected values above, i.e. importances_exp
represent the ranked feature importances of a "cooked" dataset where the input to the model is an array of random values, except two rows, which are perfectly (inversely) correlated to the target values y
. As we expect, the two highly correlated values show the highest importance and the random values show the lowest importance. The test checks that the ranking of the input values is correct by comparing the DataFrames
storing the sorted output values from clfr.feature_importances_
.
Actual Results
The expected output above, which comes from feature_importances_
when using scikit-learn==1.3.2
differs by some floating point values from the output when using scikit-learn>=1.4.2
, i.e.:
array([[0.52087185],
[0.46270245],
[0.00203539],
[0.00179985],
[0.00178587],
[0.00177495],
[0.00169148],
[0.00169134],
[0.00159994],
[0.00155658],
[0.00135364],
[0.00113665]])
where the ranking of the values is changed by the discrepancy between floating point values of the lower ranked features:
DataFrame.index values are different (16.66667 %)
[left]: Index([2, 7, 6, 10, 5, 9, 4, 8, 3, 11, 1, 0], dtype='int64')
[right]: Index([2, 7, 6, 10, 5, 9, 8, 4, 3, 11, 1, 0], dtype='int64')
At positional index 6, first diff: 4 != 8
Versions
On `Mac-arm64` with `scikit-learn==1.3.2`:
System:
python: 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)]
executable: /Users/awitmer/venv_7/bin/python
machine: macOS-14.7.5-arm64-arm-64bit
Python dependencies:
sklearn: 1.3.2
pip: 23.3.1
setuptools: 69.0.2
numpy: 1.25.0
scipy: 1.14.0
Cython: None
pandas: 2.0.2
matplotlib: 3.7.1
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
prefix: libomp
filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
num_threads: 12
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23
threading_layer: pthreads
architecture: armv8
num_threads: 12
On `Mac-arm64` with `scikit-learn==1.5.2`:
System:
python: 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)]
executable: /Users/awitmer/venv_7/bin/python
machine: macOS-14.7.5-arm64-arm-64bit
Python dependencies:
sklearn: 1.5.2
pip: 23.3.1
setuptools: 69.0.2
numpy: 1.25.0
scipy: 1.14.0
Cython: None
pandas: 2.0.2
matplotlib: 3.7.1
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23
threading_layer: pthreads
architecture: armv8
num_threads: 12
user_api: openmp
internal_api: openmp
prefix: libomp
filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
num_threads: 12
On linux based server with `scikit-learn==1.3.2`:
System:
python: 3.11.11 (main, Dec 9 2024, 15:32:27) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]
executable: /usr/bin/python3.11
machine: Linux-4.18.0-553.47.1.el8_10.x86_64-x86_64-with-glibc2.28
Python dependencies:
sklearn: 1.3.2
pip: 25.0.1
setuptools: 65.5.1
numpy: 1.26.4
scipy: 1.15.1
Cython: None
pandas: 2.2.3
matplotlib: 3.9.0
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /vast/home/awitmer/.local/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 40
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /vast/home/awitmer/.local/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
version: 0.3.23.dev
threading_layer: pthreads
architecture: Haswell
num_threads: 40