Skip to content

Discrepancy between output of classifier feature_importances_ with different sklearn installations #31415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamwitmer opened this issue May 22, 2025 · 3 comments
Labels
Bug Needs Triage Issue requires triage

Comments

@adamwitmer
Copy link

Describe the bug

I am currently using scikit-learn classifier feature_importances_ attribute on a project to rank important features from my model, and my CI pipeline runs the project test-suite using instances of scikit-learn==1.3.2 and scikit-learn==1.5.2 on a remote linux host. I am experiencing some discrepancies in the output of the relevant test (for which I have provided a minimal viable reproducer below) on different machines/installations/sklearn versions.

There are a few specific problems I am experiencing:

  1. Locally, the test will pass using a binary installation of scikit-learn==1.3.2 and fail using scikit-learn==1.5.2. With the help of my team, we have traced this error back and found the earliest failing version to be 1.4.1.post1. We suspect that the error originates from a change made in FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639 that has to do with the switch from absolute counts to store proportions in tree_.values but have not determined a root cause for the discrepancy.
  2. As mentioned in (1) when running the test-suite locally on my Mac-ARM64 machine, the test will fail as described, however, when running the test on a remote linux machine, the test will pass with both sklearn versions
  3. The test will fail when I build the code from source vs. from the binary distribution of scikit-learn==1.3.2

My main question is, what could be the cause of these observed discrepancies between sklearn version, installation type and environment and which output is most "correct"?

Steps/Code to Reproduce

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from pandas.testing import assert_frame_equal
import pdb


# this test serves as a minimal viable reproducer for the
# difference observed in output of tree values between
# sklearn versions 1.3.2 and 1.4.2. this test should pass
# when using sklearn==1.3.2 and fail when using sklearn==1.4.2

# first create a minimal dataset of values for training
# a Random Forest Classifier
random_state = 123
rng = np.random.default_rng(random_state)
X = rng.integers(0, 2, size=(1000, 12))
y = np.asarray([1, 0] * 500)
y[-1] = 1

X[:, 0] = y
X[:, 1] = 1 - y
clfr = RandomForestClassifier(n_estimators=100, random_state=random_state)
clfr.fit(X, y)

# find the importances of the estimator and check that ranking of the importances
importances_out = clfr.feature_importances_
# these are the importance values that are expected from ``sklearn==1.3.2``
importances_exp = np.array(
    [
        0.52090464,
        0.46263368,
        0.00115268,
        0.00179985,
        0.00177495,
        0.00169134,
        0.00157653,
        0.00135364,
        0.00175814,
        0.00169148,
        0.00162767,
        0.00203539,
    ]
)
importances_out = pd.DataFrame(importances_out).sort_values(by=0)
importances_exp = pd.DataFrame(importances_exp).sort_values(by=0)

assert_frame_equal(importances_out, importances_exp)

Expected Results

The expected values above, i.e. importances_exp represent the ranked feature importances of a "cooked" dataset where the input to the model is an array of random values, except two rows, which are perfectly (inversely) correlated to the target values y. As we expect, the two highly correlated values show the highest importance and the random values show the lowest importance. The test checks that the ranking of the input values is correct by comparing the DataFrames storing the sorted output values from clfr.feature_importances_.

Actual Results

The expected output above, which comes from feature_importances_ when using scikit-learn==1.3.2 differs by some floating point values from the output when using scikit-learn>=1.4.2, i.e.:

array([[0.52087185],
       [0.46270245],
       [0.00203539],
       [0.00179985],
       [0.00178587],
       [0.00177495],
       [0.00169148],
       [0.00169134],
       [0.00159994],
       [0.00155658],
       [0.00135364],
       [0.00113665]])

where the ranking of the values is changed by the discrepancy between floating point values of the lower ranked features:

DataFrame.index values are different (16.66667 %)
[left]:  Index([2, 7, 6, 10, 5, 9, 4, 8, 3, 11, 1, 0], dtype='int64')
[right]: Index([2, 7, 6, 10, 5, 9, 8, 4, 3, 11, 1, 0], dtype='int64')
At positional index 6, first diff: 4 != 8

Versions

On `Mac-arm64` with `scikit-learn==1.3.2`:

System:
    python: 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)]
executable: /Users/awitmer/venv_7/bin/python
   machine: macOS-14.7.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.2
          pip: 23.3.1
   setuptools: 69.0.2
        numpy: 1.25.0
        scipy: 1.14.0
       Cython: None
       pandas: 2.0.2
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23
threading_layer: pthreads
   architecture: armv8
    num_threads: 12

On `Mac-arm64` with `scikit-learn==1.5.2`:

System:
    python: 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)]
executable: /Users/awitmer/venv_7/bin/python
   machine: macOS-14.7.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.5.2
          pip: 23.3.1
   setuptools: 69.0.2
        numpy: 1.25.0
        scipy: 1.14.0
       Cython: None
       pandas: 2.0.2
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23
threading_layer: pthreads
   architecture: armv8
    num_threads: 12

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/awitmer/venv_7/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
    num_threads: 12

On linux based server with `scikit-learn==1.3.2`:

System:
    python: 3.11.11 (main, Dec  9 2024, 15:32:27) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]
executable: /usr/bin/python3.11
   machine: Linux-4.18.0-553.47.1.el8_10.x86_64-x86_64-with-glibc2.28

Python dependencies:
      sklearn: 1.3.2
          pip: 25.0.1
   setuptools: 65.5.1
        numpy: 1.26.4
        scipy: 1.15.1
       Cython: None
       pandas: 2.2.3
   matplotlib: 3.9.0
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /vast/home/awitmer/.local/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 40

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /vast/home/awitmer/.local/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Haswell
    num_threads: 40
@adamwitmer adamwitmer added Bug Needs Triage Issue requires triage labels May 22, 2025
@adamwitmer adamwitmer changed the title Difference between output of classifier feature_importances_ between sklearn 1.3.2 installations Discrepancy between output of classifier feature_importances_ with different sklearn installations May 22, 2025
@tylerjereddy
Copy link
Contributor

I don't know if sklearn can actually guarantee stability here--if I change the random_state of the estimator only (i.e., use the same input data, just change the estimator seed), the number of index mismatches for the importances is often quite high (almost all of the random data), whether I use Mac or Linux on the provided reproducer.

That somewhat matches the intuition that if the randomly-permuted feature data is all roughly equally unimportant for deciding the response, the tree algorithm doesn't really have a sane way to distinguish their relative importances consistently.

It may be nice to have a way to bisect this as we discussed--that is currently being hindered by the source builds differing from the binaries.

My immediate intuition is that this could mean that the in-house test may be trying to enforce something that isn't formally guaranteed to be stable in practice (the feature rankings of very noisy/random features).

Maybe @thomasjpfan or @virchan might have an opinion.

@virchan
Copy link
Member

virchan commented Jun 4, 2025

I ran the Code-to-Reproduce using scikit-learn 1.3.2 and 1.5.2, with the following library versions fixed:

Python: 3.11.7
numpy: 1.25.0
scipy: 1.14.0
pandas: 2.0.2

The final line

assert_frame_equal(importances_out, importances_exp)

passed successfully in both cases. So, from what I can tell, the difference in scikit-learn versions doesn't seem to be the cause of the issue.

Let me know if you get a different result in this setup, @adamwitmer.

I haven't yet tested the following combination:

Python: 3.11.11
numpy: 1.26.4
scipy: 1.15.1
pandas: 2.2.3

@adrinjalali
Copy link
Member

We certainly don't support comparing such values across versions.

And if you look at it, you can consider the disparity coming from floating point disparities. The actual difference in the feature importance values are minute.

As for which one is "more correct", we hope the latest release of course. That's why things change, people report issues, or we find them, and we fix them. That very often results in getting different results if you compare versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue requires triage
Projects
None yet
Development

No branches or pull requests

4 participants