You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a number of ensemble tree models in production fitted using 1.3.2 or older. Many of these models produce different outcomes when evaluated on the same data in sklearn 1.3.2 and 1.4.
Analysis led me to the change in DecisionTreeClassifier.predict_proba() from this PR: #27639
Based on this conversation I understand that in order to support monotonicity constraint, probabilities were allowed to be outside of [0, 1] bounds as see in this diff:
Tree splitting criterion was modified to ensure that probabilities are within [0, 1] bounds. This works only if the user fits the tree in 1.4 and evaluates it in the same version. I was not able to find an example where a tree fitted in 1.4 produces probabilities outside of [0,1]. However, trees fitted in older versions do violate this constraint.
Would you consider rolling back probability normalization for trees that don't have monotonicity constraint so this method produces correctly normalized probabilities on trees fitted in prior versions? I don't think scikit-learn makes any guarantees about compatibility of estimators created in older versions. However, it is quite disruptive for anyone who maintains older tree models essentially forcing people to re-fit them if they switch to 1.4.
Note: I have not provided code to reproduce because it requires 2 different versions of sklearn.
Thanks for the quick response! I looked deeper into the history of changes and understood the reasoning behind the change I suggested to revert. Cython Tree class now normalizes node values when the tree is fitted, so additional normalization in DecisionTreeClassifier is no longer necessary. I now think of this as a good change.
We were able to make trees fitted in previous versions compatible with sklearn 1.4 by normalizing tree values after deserializing them. I know it is not recommended or advised as described in the user guide: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations. However, in our circumstances keeping an existing model unchanged for as long as possible (by maintaining custom serializers and extensive testing) is more productive than re-fitting it.
Describe the bug
We have a number of ensemble tree models in production fitted using 1.3.2 or older. Many of these models produce different outcomes when evaluated on the same data in sklearn 1.3.2 and 1.4.
Analysis led me to the change in
DecisionTreeClassifier.predict_proba()
from this PR: #27639Based on this conversation I understand that in order to support monotonicity constraint, probabilities were allowed to be outside of [0, 1] bounds as see in this diff:

Tree splitting criterion was modified to ensure that probabilities are within [0, 1] bounds. This works only if the user fits the tree in 1.4 and evaluates it in the same version. I was not able to find an example where a tree fitted in 1.4 produces probabilities outside of [0,1]. However, trees fitted in older versions do violate this constraint.
Would you consider rolling back probability normalization for trees that don't have monotonicity constraint so this method produces correctly normalized probabilities on trees fitted in prior versions? I don't think scikit-learn makes any guarantees about compatibility of estimators created in older versions. However, it is quite disruptive for anyone who maintains older tree models essentially forcing people to re-fit them if they switch to 1.4.
Note: I have not provided code to reproduce because it requires 2 different versions of sklearn.
Steps/Code to Reproduce
N/A
Expected Results
probabilities add up to 1
Actual Results
Probabilities do not add up to 1
Versions
The text was updated successfully, but these errors were encountered: