FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639

samronsin · 2023-10-23T07:48:51Z

Reference Issues/PRs

Follow up on #13649 to fix a bug where classification probabilities predictions are outside of the [0, 1] range.

What does this implement/fix? Explain your changes.

Currently on main branch, probabilities predicted by all tree-based models supporting montonic_cst can be (and often are) outside of the expected [0, 1] range when enforcing a montonic_cst.

from sklearn.datasets import make_classification

n_samples = 1000
X, y = make_classification(
    n_samples=n_samples,
    n_classes=2,
    n_features=5,
    n_informative=5,
    n_redundant=0,
    random_state=1234,
)

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(monotonic_cst=[-1,1,0,0,0])
clf.fit(X, y)
clf.predict_proba(X)

>>> array([[  0.        ,   1.        ],
       [ 10.        ,  -9.        ],
       [  0.        ,   1.        ],
       ...,
       [ 47.        , -46.        ],
       [  6.        ,  -5.        ],
       [  0.79027778,   0.20972222]])

😱 😱 😱 😱
Same with random splitter:

clf = ExtraTreeClassifier(monotonic_cst=[-1,1,0,0,0], random_state=1234)
clf.fit(X, y)
clf.predict_proba(X)

>>> array([[   9.,   -8.],
       [ 191., -190.],
       [   9.,   -8.],
       ...,
       [ 191., -190.],
       [  55.,  -54.],
       [ 191., -190.]])

After the fix all the probabilities lie between 0 and 1 😅.

Any other comments?

I need to investigate further, but the bug seems another occurence (see #27630) of unexpected middle_values that are propagated down and end up producing unexpected node values through clipping.

github-actions · 2023-10-23T07:50:44Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 924066f. Link to the linter CI: here}

jjerphan

Thank you for reporting this bug and for working on a fix, @samronsin!

jjerphan · 2023-10-29T10:58:18Z

sklearn/tree/_criterion.pyx

+        if dest[0] < 0.:
+            dest[0] = 0
+        elif dest[0] > 1.:
+            dest[0] = 1.


This forces the support of values to be in $[0, 1]$, but is this actually the appropriate implementation? Isn't the problem else where in the implementation?

I don't know. I was trying to understand what the lower and upper bound that are stored in the tree builder stack (and then passed to this function) are used for/how their value is computed but not sure.

The very first node that gets pushed to the stack uses -inf and +inf for the lower and upper bounds. Which would suggest it is possible for the values to be out side the range of zero to one?

You were right @jjerphan the first attempt was a quick and dirty fix. As suspected there was something else going on... Now with d74b012 we're closer to a real solution, although we still have to understand the remaining floating point precisions issues...

I think clipping this is a good hot fix for the support and we may first proceed with it.

But values are not supposed to be in [0, 1].

But values are not supposed to be in [0, 1].

Exactly ! I was not aware of this until your comment #27639 (comment) @glemaitre.

glemaitre · 2023-10-30T14:00:20Z

From your minimal reproducer, I limited the depth to 1. I get the following tree:

I assume that we don't compute the values properly because we always have -n_class_0 + 1. while expecting something like:

It is a while I did not check the code of the tree. I will try to have a look. But it seems that we push wrong count statistic for the class 1 from the root node.

glemaitre · 2023-10-30T14:28:37Z

sklearn/tree/_criterion.pyx

        # Class proportions for binary classification must sum to 1.
        dest[1] = 1 - dest[0]


OK so the bug is here. We don't expect to have a sum at 1 but a weighted count.
So we should have something like the sum_total - dest[0] instead.
I don't know if we have access to sum_total thought at this stage.

Uhm actually this is not even sum_total because this is grouped by labels.

So with my limited knowledge of the wider implementation of the constraints, I would come with the following code:

cdef float64_t total_weighted_count = dest[0] + dest[1] if dest[0] < lower_bound: dest[0] = lower_bound elif dest[0] > upper_bound: dest[0] = upper_bound dest[1] = total_weighted_count - dest[0]

Thanks a lot for looking into this @glemaitre !

We don't expect to have a sum at 1 but a weighted count

I was not aware of this obviously, so this is a big mistake in the monotonic constraints implementation indeed !

Uhm actually this is not even sum_total because this is grouped by labels.

Monotonic constraints are only supported with single label, so this should not be an issue.

Note that lower_bound and upper_bound are positive class frequencies so we would need to modify your solution to

cdef float64_t total_weighted_count = dest[0] + dest[1] if dest[0] < lower_bound * total_weighted_count: dest[0] = lower_bound * total_weighted_count elif dest[0] > upper_bound * total_weighted_count: dest[0] = upper_bound * total_weighted_count dest[1] = total_weighted_count - dest[0]

The patch almost works !

test_monotonic_constraints_classifications currently fails due to floating point arithmetic fluctuations. The tests pass if the predicted probabilities are cast to np.float32 so the overall logic is valid, which is good news.

I'm looking into the float64 precision matter.

Nice. Now we only have to deal with the 25 CIs with some 32-bits arch ;)

If we have some casting in float32, we can probably play with the relative tolerance.

My understanding is that the numerical noise comes from the predictions on clipped leaves:

(lower_bound * total_weighted_count) / total_weighted_count

which do vary with total_weighted_count (in floating point arithmetic) close to lower_bound. So two leaves can have values (before clipping) trespassing the same constraints, say lower_bound, but with different total_weighted_count and their predictions fluctuate a bit and sometimes break the monotonicity.

I found a solution that works by loosing precision on lower_bound (and upper_bound) before performing the scaling:

cdef float64_t total_weighted_count = dest[0] + dest[1] cdef float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count cdef float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count if dest[0] < scaled_lower_bound: dest[0] = scaled_lower_bound elif dest[0] > scaled_upper_bound: dest[0] = scaled_upper_bound dest[1] = total_weighted_count - dest[0]

However, it does have limitations:

sample_weights that have a precision of float32 or more (tested float16 and this is OK)

n_samples too close to 2**32, so larger than a few hundred of millions (not tested for n_samples beyond a few millions though...)

This is probably the best that we can do. I don't know if we should have a check on the sample weight then in this specific setting.

glemaitre · 2023-10-30T14:36:07Z

I am adding the 1.4 milestone and blocker tag since we cannot release with this bug.

sklearn/tree/tests/test_monotonic_tree.py

…d probabilities

glemaitre

Uhm apparently I have some comments that I forgot to post.

sklearn/tree/_criterion.pyx

sklearn/tree/tests/test_monotonic_tree.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre

LGTM on my side. Thanks @samronsin

lorentzenchr · 2023-11-28T11:55:50Z

sklearn/tree/_criterion.pyx

+            float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count
+            float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count


Suggested change

float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count

float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count

float64_t scaled_lower_bound = <float32_t>(lower_bound * total_weighted_count)

float64_t scaled_upper_bound = <float32_t>(upper_bound * total_weighted_count)

To void early loss of precision.
Why do we need to cast to float32 at all?

To void early loss of precision.
Why do we need to cast to float32 at all?

The precision loss for lower_bound and upper_bound (and not for their scaled counterpart) is by design.

The subtlety here is that for leaves affected by clipping (say constrained by lower_bound), the predicted probabilities (without casting) would be (lower_bound * total_weighted_count) / total_weighted_count, which varies with total_weighted_count close to lower_bound.
So if two leaves end up having the same lower_bound desired prediction, but two different total_weighted_count, their predictions might end up being (slightly) different and then (slightly) breaking the monotonicity constraint.

By casting lower_bound to float32 we make ((<float32_t>lower_bound) * total_weighted_count) / total_weighted_count effectively constant for a pretty large set of total_weighted_count values, most interestingly integers < 2**30. The reason is that:

the multiplication by total_weighted_count in this subset of values can be done without rounding

the division by total_weighted_count (the same number) will be exact thus requiring no rounding either and equal to the initial <float32_t>lower_bound.

Please let me know if any of this is not clear enough !

Thanks for the explanation. It would be good to have a short summary of this as code comment.

Then, about the solution: By casting to float32, we loose 7 significant digits, while the loss of accuracy by a multiplication and a division is roughly 1 ulp. I would therefore prefer to have something like

float64_t scaled_lower_bound = nextafter(lower_bound, 1) * total_weighted_count float64_t scaled_upper_bound = nextafter(upper_bound, 0) * total_weighted_count

Thanks for the suggestions. Adding an explanatory comment sounds like a good idea, not sure how short I can make it though...

Regarding the precision loss on lower_bound and upper_bound, we want the sequence of multiplication and division by the same number to be exactly neutral, which requires more loss of precision than a single bit.
As you can see

def check_nextafter(a, n): return (nextafter(a, 1) * n) / n == nextafter(a, 1) check_nextafter(0.8793444460978181, 700) >>> False

whereas

def check_castfloat32(a, n): return (np.float32(a) * n) / n == np.float32(a)

requires n > 2**29 for returning False, which seems good enough.

That's not a fair comparison and not the one we are interested in. We want scaled_lower_bound / total_weighted_count >= lower_bound, but as close as possible (that's where float32 sucks, in the end, it might not influence the estimator in a noticable way).

from math import nextafter import numpy as np def check_nextafter(a, n): """a between 0 and 1, n any number""" return (nextafter(a, 1) * n) / n >= a sum([not check_nextafter(0.8793444460978181, n) for n in range(1, 1000_000)]) # 0 sum([not check_nextafter(0.8793444460978181, n) for n in np.logspace(0, 100, 1_000_000)]) # 0

Which assert in which test does fail?
I don't now of a "multi_nextafter", but we could just use lower_bound * (1 + x * np.finfo(float).eps) and lower_bound * (1 - x * np.finfo(float).epsneg) for some x > 1.

for some x > 1

If this is for the steps, the nextafter from python as a steps parameter.

Which assert in which test does fail?

In test_monotonic_constraints_classifications:

assert np.all(est.predict_proba(X_test_0incr)[:, 1] >= proba_test[:, 1])

the nextafter from python as a steps parameter.

Indeed, this would be useful, although it was only added in Python 3.12...

Arff my bad. I did not check the full documentation. Python 3.12 is quite ahead.

lorentzenchr · 2023-11-28T11:56:47Z

sklearn/tree/_criterion.pyx

+            dest[0] = scaled_upper_bound
+
+        # Values for binary classification must sum to total_weighted_count.
+        dest[1] = total_weighted_count - dest[0]


Suggested change

dest[1] = total_weighted_count - dest[0]

dest[1] = max(0, total_weighted_count - dest[0])

Do we need to protect dest[1] to be non-negative?

lower_bound and upper_bound are computed as positive class frequencies (or averages of positive class frequencies), so they should always be in the [0, 1] range, and thus dest[1] should always be non-negative.

If we decide to protect as you suggest though, note that we should make sure the invariant dest[0] + dest[1] = total_weighted_count holds.

glemaitre

I'll give another round of review next week. I need to get deeper on the bottom of it.

lorentzenchr · 2023-12-04T18:52:31Z

I played a little with this PR and I think, the cleanest solution is to change the node values from counts to probabilities (as is done for regression trees and in HGBT):

In cdef class ClassificationCriterion(Criterion):, set

cdef void node_value(self, float64_t* dest) noexcept nogil:
    cdef intp_t k, c

    for k in range(self.n_outputs):
        for c in range(self.n_classes[k]):
            dest[c] = self.sum_total[k, c] / self.weighted_n_node_samples
        dest += self.max_n_classes

Then we don't need any lower/upper_bound clipping and can have almost the same as for regression criteria:

cdef void clip_node_value(
        self, float64_t * dest, float64_t lower_bound, float64_t upper_bound
    ) noexcept nogil:
    cdef:
    float64_t total_weighted_count = dest[0] + dest[1]

    if dest[0] < lower_bound:
        dest[0] = lower_bound
    elif dest[0] > upper_bound:
        dest[0] = upper_bound

    # Values for binary classification must sum to total_weighted_count.
    dest[1] = total_weighted_count - dest[0]

I did not benchmark, and Yes, it might have a veeeery minor impact on the tree fitting. We could, however, in a follow-up PR, make predict_proba faster, as the normalization step then is obsolete.

samronsin · 2023-12-05T18:02:01Z

I think, the cleanest solution is to change the node values from counts to probabilities

This is also my favorite solution. And it actually would require no change in clip_node_value at all, because I wrongfully assumed node values were already probabilities in the main PR for monotonic constraints...

I'm actually surprised at how little the rest of the codebase is impacted by changes of the value of the decision trees, the only tests I had to modify were in tree/tests/test_export.py.

lorentzenchr · 2023-12-05T18:54:52Z

@samronsin Could you push those commits to make the node values probabilities instead of counts?

To fix the monotonicity in trees is a release blocker for 1.4.

…es instead of counts

lorentzenchr

LGTM
Only minor changes.

For the 2. reviewer: This seems like a innocent change but it may have some yet undiscovered side effects. For instance, I don’t know if the tree export/visualize funtions need to be adapted, too.

sklearn/tree/_criterion.pyx

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

sklearn/tree/tests/test_export.py

glemaitre · 2023-12-06T10:21:58Z

sklearn/tree/_export.py

+                tree.n_classes[0] == 1
+                and isinstance(node_val, Iterable)
+                and self.colors["bounds"] is not None
+            ):


This is indeed much cleaner.

glemaitre

Those changes make everything much simpler. I'll give a deeper look at the export function later today. But for the fix on the tree it looks good.

… and ExtraTreeClassifier

glemaitre

LGTM.

glemaitre

LGTM.

glemaitre · 2023-12-07T15:28:49Z

Thanks @samronsin and @lorentzenchr to have taking time addressing and converging towards the best solution.

samronsin · 2023-12-08T08:58:09Z

Thank you @glemaitre and @lorentzenchr !

jjerphan · 2023-12-08T09:40:14Z

Thank you for fixing this, @glemaitre, @samronsin and @lorentzenchr.

Fix node values outside of [0, 1] range

dd5ae50

github-actions bot added cython module:tree labels Oct 23, 2023

samronsin changed the title ~~Fix node values outside of [0, 1] range~~ Fix node values outside of [0, 1] range for monotonically constraints classification trees Oct 23, 2023

jjerphan reviewed Oct 29, 2023

View reviewed changes

glemaitre self-requested a review October 30, 2023 13:17

glemaitre changed the title ~~Fix node values outside of [0, 1] range for monotonically constraints classification trees~~ FIX Fix node values outside of [0, 1] range for monotonically constraints classification trees Oct 30, 2023

glemaitre changed the title ~~FIX Fix node values outside of [0, 1] range for monotonically constraints classification trees~~ FIX force node values outside of [0, 1] range for monotonically constraints classification trees Oct 30, 2023

fix linter

283a94e

glemaitre added the No Changelog Needed label Oct 30, 2023

glemaitre reviewed Oct 30, 2023

View reviewed changes

glemaitre added the Blocker label Oct 30, 2023

glemaitre added this to the 1.4 milestone Oct 30, 2023

glemaitre mentioned this pull request Oct 30, 2023

DOC GradientBoosting* will not implement monotonic constraints, use HistGradientBoosting* instead #27516

Merged

Fix scale in ClassificationCriterion clip_node_value

d74b012

glemaitre reviewed Oct 31, 2023

View reviewed changes

sklearn/tree/tests/test_monotonic_tree.py Outdated Show resolved Hide resolved

samronsin added 3 commits November 7, 2023 13:00

Remove some numerical noise in clipped values and subsequent predicte…

c2e3f59

…d probabilities

Add test on the sum of predicted probabilities

8ec1ca3

Fix linter

2f72328

glemaitre self-requested a review November 9, 2023 10:29

glemaitre reviewed Nov 14, 2023

View reviewed changes

samronsin and others added 4 commits November 14, 2023 17:04

Code format

3448e1d

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Use assert_allclose

b52776a

Fix linting

137d249

Fix format

0e82410

glemaitre approved these changes Nov 27, 2023

View reviewed changes

lorentzenchr reviewed Nov 28, 2023

View reviewed changes

glemaitre reviewed Dec 1, 2023

View reviewed changes

glemaitre self-requested a review December 1, 2023 23:02

samronsin added 2 commits December 5, 2023 21:48

Make node_value method in ClassificationCriterion compute probabiliti…

0d28f8f

…es instead of counts

Fix classification trees export tests

aa88ff4

lorentzenchr approved these changes Dec 5, 2023

View reviewed changes

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

samronsin and others added 2 commits December 6, 2023 09:10

Clean up

f4f6616

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Inline clip_node_value

37ad0a9

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

samronsin commented Dec 6, 2023

View reviewed changes

sklearn/tree/tests/test_export.py Outdated Show resolved Hide resolved

samronsin added 3 commits December 6, 2023 10:33

Fix handling of proportion attribute in _BaseTreeExporter

381b106

Fix export_text

3e1fdfb

Black

5029a16

glemaitre reviewed Dec 6, 2023

View reviewed changes

Remove obsolete predict_proba normalisation in DecisionTreeClassifier…

924066f

… and ExtraTreeClassifier

glemaitre self-requested a review December 7, 2023 15:07

glemaitre approved these changes Dec 7, 2023

View reviewed changes

glemaitre merged commit 8b95194 into scikit-learn:main Dec 7, 2023

mosalx mentioned this pull request Jan 23, 2024

Trees fitted in 1.3.2 produce different outcome when evaluated in 1.4 #28229

Closed

glemaitre mentioned this pull request May 1, 2024

Undocumented change in tree_.value example for DecisionTreeClassifier between versions 1.3.2 and 1.4.2 #28921

Closed

ArturoAmorQ mentioned this pull request Jun 26, 2024

DOC Fix tree explanation of tree_.value in example #29331

Merged

adam2392 mentioned this pull request Jul 2, 2024

API Change default argument for proportion to be True in tree.plot_tree #29383

Open

vkozmik mentioned this pull request Nov 8, 2024

Random Forest classifier does not work correctly with scikit >= 1.4.1 iamDecode/sklearn-pmml-model#61

Closed

adamwitmer mentioned this pull request May 22, 2025

Discrepancy between output of classifier feature_importances_ with different sklearn installations #31415

Closed

		# Class proportions for binary classification must sum to 1.
		dest[1] = 1 - dest[0]

		float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count
		float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count

	dest[1] = total_weighted_count - dest[0]
	dest[1] = max(0, total_weighted_count - dest[0])

Uh oh!

FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639

FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639

Uh oh!

Conversation

samronsin commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samronsin Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 30, 2023

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samronsin Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

samronsin commented Oct 23, 2023 •

edited

Loading

github-actions bot commented Oct 23, 2023 •

edited

Loading

glemaitre commented Oct 30, 2023 •

edited

Loading

samronsin Nov 6, 2023 •

edited

Loading

samronsin Dec 1, 2023 •

edited

Loading

lorentzenchr commented Dec 5, 2023 •

edited

Loading