Skip to content

scikit-learn 1.6 changed behavior of growing trees #30554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sebp opened this issue Dec 28, 2024 · 3 comments · Fixed by #30557
Closed

scikit-learn 1.6 changed behavior of growing trees #30554

sebp opened this issue Dec 28, 2024 · 3 comments · Fixed by #30557

Comments

@sebp
Copy link
Contributor

sebp commented Dec 28, 2024

Describe the bug

While porting scikit-survival to support scikit-learn 1.6, I noticed that one test failed due to trees in a random forest having a different structure (see this GitHub Actions log).

Using git bisect, I could determine that #29458 is the culprit.

The PR imports log from libc.math:

from libc.math cimport isnan, log

Previously, logwas imported from ._utils:

from ._utils cimport log

which actually implements log2:

cdef inline float64_t log(float64_t x) noexcept nogil:
return ln(x) / ln(2.0)

Replacing

 from libc.math cimport isnan, log 

with

from libc.math cimport isnan, log2 as log

fixes the problem.

Steps/Code to Reproduce

from collections import namedtuple
import numpy as np
from sksurv.datasets import load_whas500
from sksurv.column import standardize, categorical_to_numeric
from sksurv.tree import SurvivalTree

from sklearn.tree import export_graphviz

DataSetWithNames = namedtuple("DataSetWithNames", ["x", "y", "names", "x_data_frame"])


def _make_whas500(with_mean=True, with_std=True, to_numeric=False):
    x, y = load_whas500()
    if with_mean:
        x = standardize(x, with_std=with_std)
    if to_numeric:
        x = categorical_to_numeric(x)
    names = ["(Intercept)"] + x.columns.tolist()
    return DataSetWithNames(x=x.values, y=y, names=names, x_data_frame=x)


whas500 = _make_whas500(to_numeric=True)

rng = np.random.RandomState(42)
mask = rng.binomial(n=1, p=0.15, size=whas500.x.shape)
mask = mask.astype(bool)
X = whas500.x.copy()
X[mask] = np.nan

X_train = X[:400]
y_train = whas500.y[:400]
weights = np.array([
    4,5,1,1,2,1,1,1,2,1,0,1,0,1,4,2,0,0,1,0,1,0,1,2,1,1,1,1,1,0,0,0,1,1,3,1,2,1,2,1,0,3,1,0,0,3,0,1,4,1,0,0,2,1,0,1,0,
    1,0,2,1,0,1,1,4,4,2,1,2,2,4,2,1,1,2,1,0,1,0,1,0,0,1,1,1,1,1,1,1,1,1,0,1,3,0,3,0,1,1,1,3,0,1,2,2,3,0,0,1,1,2,0,0,2,
    0,0,1,0,0,1,2,1,2,0,1,1,0,0,0,2,1,1,2,1,0,1,0,1,1,0,2,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,2,0,0,0,0,0,0,0,2,3,0,0,2,0,
    0,1,2,0,1,0,1,2,0,0,0,0,1,2,0,0,2,0,0,0,1,0,2,2,0,1,0,1,4,1,0,0,3,0,1,1,0,1,1,0,2,0,2,1,4,0,1,0,1,0,2,1,3,1,0,0,2,
    1,1,0,2,1,2,2,0,2,0,0,1,1,1,3,0,2,2,0,0,1,3,0,2,0,0,1,1,4,1,0,0,1,1,2,1,1,1,2,0,2,1,1,1,2,0,0,0,1,0,0,2,0,0,0,0,0,
    0,3,1,2,0,3,1,4,1,2,0,0,1,1,2,2,1,1,3,1,1,1,1,1,0,0,0,0,2,2,1,2,0,2,1,2,0,2,0,1,0,0,1,1,1,1,1,3,1,2,0,2,2,2,1,3,1,
    0,0,0,0,0,1,0,2,2,1,1,2,0,0,0,2,2,1,0,1,0,2,0,1,0,2,2,0,3,2,2,1,0,3,0,0,2,2,0,1,0,2,1,1,0,0,2,1,1,0,0,2,1,0,0,2,2,3
], dtype=float)

t = SurvivalTree(
    low_memory=True,
    max_depth=3,
    max_features='sqrt',
    max_leaf_nodes=None,
    min_samples_leaf=3,
    min_samples_split=6,
    min_weight_fraction_leaf=0.0,
    random_state=1608637542,
    splitter='best',
)
t.fit(X_train, y_train, weights)

export_graphviz(
    t, "tree.dot", label="none", impurity=False
)

Expected Results

tree-1-5

Actual Results

tree-1-6

Versions

System:
    python: 3.13.0 (main, Oct  7 2024, 23:47:22) [Clang 18.1.8 ]
executable: /…/.venv/bin/python
   machine: macOS-15.2-arm64-arm-64bit-Mach-O

Python dependencies:
      sklearn: 1.6.0
          pip: None
   setuptools: 75.6.0
        numpy: 2.2.1
        scipy: 1.14.1
       Cython: 3.0.11
       pandas: 2.2.3
   matplotlib: None
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /opt/homebrew/Cellar/libomp/19.1.6/lib/libomp.dylib
        version: None
@thomasjpfan
Copy link
Member

Thank you for the report! I agree this is a bug & opened #30557 to fix it.

@thomasjpfan thomasjpfan added module:tree and removed Needs Triage Issue requires triage labels Dec 30, 2024
@lesteve
Copy link
Member

lesteve commented Jan 4, 2025

@sebp thanks a lot for the reproducer and the diagnostic! As a maintainer, high quality reports like yours are so useful 🙏!

Something I would suggest if you think this makes sense for your project: test scikit-learn development wheels so you can give feed-back closer to the time the PR is merged rather than after we release.

If you think it it not worth it for your project, it is completely fine of course! For example, if this is the first time in 5 years that a new scikit-learn release breaks scikit-survival, then this is probably not worth it ...

If you think this is worth it, I would be keen to help a bit on the CI set-up side, so let me know! This would be an opportunity for me to realise how much work this is and get a better feeling of the trade-off involved.

The basic idea is to install scikit-learn development wheel (built daily from main) through a command like this:

pip install --pre --extra-index https://pypi.anaconda.org/scientific-python-nightly-wheels/simple scikit-learn

see https://scikit-learn.org/dev/developers/advanced_installation.html#installing-nightly-builds. There is a number of other projects, in case you feel more adventurous and want to test other projects dependencies (numpy, scipy, matplotlib, etc ...) see https://anaconda.org/scientific-python-nightly-wheels/repo for the full list.

Why you would want to do this? (optional reading 😅)

We have been doing something similar in scikit-learn for numpy and scipy for some time, and I find this extremely useful for all the projects involved.

Advantages for downstream (scikit-learn in our case, scikit-survival in your case)

  • this does not break too often so not such a pain to maintain for downstream (scikit-learn in our case, scikit-survival in your case) after the initial CI set-up cost.
  • figuring/guessing which PR may be the culprit is easier since you generally have only a few commits between two nightly wheels

Advantages for upstream (numpy or scipy in our case, scikit-learn in your case)

  • getting feed-back about unwanted side-effects. It is easy to miss edge cases or have blind spots in our tests.
  • the feed-back for upstream gets is quite close to when the PR was merged, so PR contributors still have the details of the PR fresh in their mind and the fix tends to be less painful. Contrast this to having the feed-back a few months after you work on something ...

Advantages for the longer-term health of the ecosystem

  • we learn about upstream and downstream projects and understand each other a bit more
  • this strengthens links and build trust between the projects
  • we make the ecosystem more robust

Feel-good stories about this kind of cross-projects interactions (optional reading 😅)

Here are a few of these interactions I managed to find (mostly because they happened not so long ago) and that are part of my personal feel-good open-source stories:

sebp added a commit to sebp/scikit-survival that referenced this issue Jan 11, 2025
sebp added a commit to sebp/scikit-survival that referenced this issue Jan 11, 2025
@sebp
Copy link
Contributor Author

sebp commented Mar 15, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants