Skip to content

Trees fitted in 1.3.2 produce different outcome when evaluated in 1.4 #28229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mosalx opened this issue Jan 23, 2024 · 3 comments
Closed

Trees fitted in 1.3.2 produce different outcome when evaluated in 1.4 #28229

mosalx opened this issue Jan 23, 2024 · 3 comments
Labels

Comments

@mosalx
Copy link

mosalx commented Jan 23, 2024

Describe the bug

We have a number of ensemble tree models in production fitted using 1.3.2 or older. Many of these models produce different outcomes when evaluated on the same data in sklearn 1.3.2 and 1.4.

Analysis led me to the change in DecisionTreeClassifier.predict_proba() from this PR: #27639

Based on this conversation I understand that in order to support monotonicity constraint, probabilities were allowed to be outside of [0, 1] bounds as see in this diff:
image

Tree splitting criterion was modified to ensure that probabilities are within [0, 1] bounds. This works only if the user fits the tree in 1.4 and evaluates it in the same version. I was not able to find an example where a tree fitted in 1.4 produces probabilities outside of [0,1]. However, trees fitted in older versions do violate this constraint.

Would you consider rolling back probability normalization for trees that don't have monotonicity constraint so this method produces correctly normalized probabilities on trees fitted in prior versions? I don't think scikit-learn makes any guarantees about compatibility of estimators created in older versions. However, it is quite disruptive for anyone who maintains older tree models essentially forcing people to re-fit them if they switch to 1.4.

Note: I have not provided code to reproduce because it requires 2 different versions of sklearn.

Steps/Code to Reproduce

N/A

Expected Results

probabilities add up to 1

Actual Results

Probabilities do not add up to 1

Versions

System:
    python: 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:27:59) [MSC v.1937 64 bit (AMD64)]
executable: C:\Users\smozharov\AppData\Local\miniconda3\envs\env311_3a\python.exe
   machine: Windows-10-10.0.22631-SP0

Python dependencies:
      sklearn: 1.4.0
          pip: 23.3.2
   setuptools: 69.0.3
        numpy: 1.26.3
        scipy: 1.12.0
       Cython: 3.0.7
       pandas: 2.2.0
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libomp
       filepath: C:\Users\smozharov\AppData\Local\miniconda3\envs\env311_3a\Library\bin\libomp.dll
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libblas
       filepath: C:\Users\smozharov\AppData\Local\miniconda3\envs\env311_3a\Library\bin\libblas.dll
        version: 0.3.25
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: vcomp
       filepath: C:\Users\smozharov\AppData\Local\miniconda3\envs\env311_3a\vcomp140.dll
@mosalx mosalx added Bug Needs Triage Issue requires triage labels Jan 23, 2024
@adrinjalali
Copy link
Member

That's expected. We only support loading models with the same scikit-learn version as they were persisted: https://scikit-learn.org/stable/model_persistence.html

@adrinjalali adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Jan 23, 2024
@adrinjalali
Copy link
Member

We can reopen the issue if you provide a minimal reproducible with the latest version.

@mosalx
Copy link
Author

mosalx commented Jan 24, 2024

Thanks for the quick response! I looked deeper into the history of changes and understood the reasoning behind the change I suggested to revert. Cython Tree class now normalizes node values when the tree is fitted, so additional normalization in DecisionTreeClassifier is no longer necessary. I now think of this as a good change.
We were able to make trees fitted in previous versions compatible with sklearn 1.4 by normalizing tree values after deserializing them. I know it is not recommended or advised as described in the user guide: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations. However, in our circumstances keeping an existing model unchanged for as long as possible (by maintaining custom serializers and extensive testing) is more productive than re-fitting it.

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants