Skip to content

FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

samronsin
Copy link
Contributor

@samronsin samronsin commented Oct 23, 2023

Reference Issues/PRs

Follow up on #13649 to fix a bug where classification probabilities predictions are outside of the [0, 1] range.

What does this implement/fix? Explain your changes.

Currently on main branch, probabilities predicted by all tree-based models supporting montonic_cst can be (and often are) outside of the expected [0, 1] range when enforcing a montonic_cst.

from sklearn.datasets import make_classification

n_samples = 1000
X, y = make_classification(
    n_samples=n_samples,
    n_classes=2,
    n_features=5,
    n_informative=5,
    n_redundant=0,
    random_state=1234,
)

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(monotonic_cst=[-1,1,0,0,0])
clf.fit(X, y)
clf.predict_proba(X)
>>> array([[  0.        ,   1.        ],
       [ 10.        ,  -9.        ],
       [  0.        ,   1.        ],
       ...,
       [ 47.        , -46.        ],
       [  6.        ,  -5.        ],
       [  0.79027778,   0.20972222]])

😱 😱 😱 😱
Same with random splitter:

clf = ExtraTreeClassifier(monotonic_cst=[-1,1,0,0,0], random_state=1234)
clf.fit(X, y)
clf.predict_proba(X)
>>> array([[   9.,   -8.],
       [ 191., -190.],
       [   9.,   -8.],
       ...,
       [ 191., -190.],
       [  55.,  -54.],
       [ 191., -190.]])

After the fix all the probabilities lie between 0 and 1 😅.

Any other comments?

I need to investigate further, but the bug seems another occurence (see #27630) of unexpected middle_values that are propagated down and end up producing unexpected node values through clipping.

@github-actions
Copy link

github-actions bot commented Oct 23, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 924066f. Link to the linter CI: here

@samronsin samronsin changed the title Fix node values outside of [0, 1] range Fix node values outside of [0, 1] range for monotonically constraints classification trees Oct 23, 2023
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reporting this bug and for working on a fix, @samronsin!

Comment on lines 589 to 592
if dest[0] < 0.:
dest[0] = 0
elif dest[0] > 1.:
dest[0] = 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces the support of values to be in $[0, 1]$, but is this actually the appropriate implementation? Isn't the problem else where in the implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I was trying to understand what the lower and upper bound that are stored in the tree builder stack (and then passed to this function) are used for/how their value is computed but not sure.

The very first node that gets pushed to the stack uses -inf and +inf for the lower and upper bounds. Which would suggest it is possible for the values to be out side the range of zero to one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right @jjerphan the first attempt was a quick and dirty fix. As suspected there was something else going on... Now with d74b012 we're closer to a real solution, although we still have to understand the remaining floating point precisions issues...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clipping this is a good hot fix for the support and we may first proceed with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But values are not supposed to be in [0, 1].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But values are not supposed to be in [0, 1].

Exactly ! I was not aware of this until your comment #27639 (comment) @glemaitre.

@glemaitre glemaitre self-requested a review October 30, 2023 13:17
@glemaitre glemaitre changed the title Fix node values outside of [0, 1] range for monotonically constraints classification trees FIX Fix node values outside of [0, 1] range for monotonically constraints classification trees Oct 30, 2023
@glemaitre glemaitre changed the title FIX Fix node values outside of [0, 1] range for monotonically constraints classification trees FIX force node values outside of [0, 1] range for monotonically constraints classification trees Oct 30, 2023
@glemaitre
Copy link
Member

glemaitre commented Oct 30, 2023

From your minimal reproducer, I limited the depth to 1. I get the following tree:

image

I assume that we don't compute the values properly because we always have -n_class_0 + 1. while expecting something like:

image

It is a while I did not check the code of the tree. I will try to have a look. But it seems that we push wrong count statistic for the class 1 from the root node.

Comment on lines 594 to 595
# Class proportions for binary classification must sum to 1.
dest[1] = 1 - dest[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the bug is here. We don't expect to have a sum at 1 but a weighted count.
So we should have something like the sum_total - dest[0] instead.
I don't know if we have access to sum_total thought at this stage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm actually this is not even sum_total because this is grouped by labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with my limited knowledge of the wider implementation of the constraints, I would come with the following code:

        cdef float64_t total_weighted_count = dest[0] + dest[1]

        if dest[0] < lower_bound:
            dest[0] = lower_bound
        elif dest[0] > upper_bound:
            dest[0] = upper_bound

        dest[1] = total_weighted_count - dest[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into this @glemaitre !

We don't expect to have a sum at 1 but a weighted count

I was not aware of this obviously, so this is a big mistake in the monotonic constraints implementation indeed !

Uhm actually this is not even sum_total because this is grouped by labels.

Monotonic constraints are only supported with single label, so this should not be an issue.

Note that lower_bound and upper_bound are positive class frequencies so we would need to modify your solution to

        cdef float64_t total_weighted_count = dest[0] + dest[1]

        if dest[0] < lower_bound * total_weighted_count:
            dest[0] = lower_bound * total_weighted_count
        elif dest[0] > upper_bound * total_weighted_count:
            dest[0] = upper_bound * total_weighted_count

        dest[1] = total_weighted_count - dest[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch almost works !

test_monotonic_constraints_classifications currently fails due to floating point arithmetic fluctuations. The tests pass if the predicted probabilities are cast to np.float32 so the overall logic is valid, which is good news.

I'm looking into the float64 precision matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Now we only have to deal with the 25 CIs with some 32-bits arch ;)

If we have some casting in float32, we can probably play with the relative tolerance.

Copy link
Contributor Author

@samronsin samronsin Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the numerical noise comes from the predictions on clipped leaves:

(lower_bound * total_weighted_count) / total_weighted_count

which do vary with total_weighted_count (in floating point arithmetic) close to lower_bound. So two leaves can have values (before clipping) trespassing the same constraints, say lower_bound, but with different total_weighted_count and their predictions fluctuate a bit and sometimes break the monotonicity.

I found a solution that works by loosing precision on lower_bound (and upper_bound) before performing the scaling:

        cdef float64_t total_weighted_count = dest[0] + dest[1]
        cdef float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count
        cdef float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count

        if dest[0] < scaled_lower_bound:
            dest[0] = scaled_lower_bound
        elif dest[0] > scaled_upper_bound:
            dest[0] = scaled_upper_bound

        dest[1] = total_weighted_count - dest[0]

However, it does have limitations:

  • sample_weights that have a precision of float32 or more (tested float16 and this is OK)
  • n_samples too close to 2**32, so larger than a few hundred of millions (not tested for n_samples beyond a few millions though...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the best that we can do. I don't know if we should have a check on the sample weight then in this specific setting.

@glemaitre glemaitre added this to the 1.4 milestone Oct 30, 2023
@glemaitre
Copy link
Member

I am adding the 1.4 milestone and blocker tag since we cannot release with this bug.

@glemaitre glemaitre self-requested a review November 9, 2023 10:29
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm apparently I have some comments that I forgot to post.

samronsin and others added 4 commits November 14, 2023 17:04
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my side. Thanks @samronsin

Comment on lines 589 to 590
float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count
float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count
float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count
float64_t scaled_lower_bound = <float32_t>(lower_bound * total_weighted_count)
float64_t scaled_upper_bound = <float32_t>(upper_bound * total_weighted_count)

To void early loss of precision.
Why do we need to cast to float32 at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To void early loss of precision.
Why do we need to cast to float32 at all?

The precision loss for lower_bound and upper_bound (and not for their scaled counterpart) is by design.

The subtlety here is that for leaves affected by clipping (say constrained by lower_bound), the predicted probabilities (without casting) would be (lower_bound * total_weighted_count) / total_weighted_count, which varies with total_weighted_count close to lower_bound.
So if two leaves end up having the same lower_bound desired prediction, but two different total_weighted_count, their predictions might end up being (slightly) different and then (slightly) breaking the monotonicity constraint.

By casting lower_bound to float32 we make ((<float32_t>lower_bound) * total_weighted_count) / total_weighted_count effectively constant for a pretty large set of total_weighted_count values, most interestingly integers < 2**30. The reason is that:

  • the multiplication by total_weighted_count in this subset of values can be done without rounding
  • the division by total_weighted_count (the same number) will be exact thus requiring no rounding either and equal to the initial <float32_t>lower_bound.

Please let me know if any of this is not clear enough !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. It would be good to have a short summary of this as code comment.

Then, about the solution: By casting to float32, we loose 7 significant digits, while the loss of accuracy by a multiplication and a division is roughly 1 ulp. I would therefore prefer to have something like

float64_t scaled_lower_bound = nextafter(lower_bound, 1) * total_weighted_count
float64_t scaled_upper_bound = nextafter(upper_bound, 0) * total_weighted_count

Copy link
Contributor Author

@samronsin samronsin Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. Adding an explanatory comment sounds like a good idea, not sure how short I can make it though...

Regarding the precision loss on lower_bound and upper_bound, we want the sequence of multiplication and division by the same number to be exactly neutral, which requires more loss of precision than a single bit.
As you can see

def check_nextafter(a, n): return (nextafter(a, 1) * n) / n == nextafter(a, 1)
check_nextafter(0.8793444460978181, 700)
>>> False

whereas

def check_castfloat32(a, n): return (np.float32(a) * n) / n == np.float32(a)

requires n > 2**29 for returning False, which seems good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a fair comparison and not the one we are interested in. We want scaled_lower_bound / total_weighted_count >= lower_bound, but as close as possible (that's where float32 sucks, in the end, it might not influence the estimator in a noticable way).

from math import nextafter
import numpy as np

def check_nextafter(a, n):
     """a between 0 and 1, n any number"""
     return (nextafter(a, 1) * n) / n >= a

sum([not check_nextafter(0.8793444460978181, n) for n in range(1, 1000_000)])  # 0
sum([not check_nextafter(0.8793444460978181, n) for n in np.logspace(0, 100,  1_000_000)])  # 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which assert in which test does fail?
I don't now of a "multi_nextafter", but we could just use lower_bound * (1 + x * np.finfo(float).eps) and lower_bound * (1 - x * np.finfo(float).epsneg) for some x > 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some x > 1

If this is for the steps, the nextafter from python as a steps parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which assert in which test does fail?

In test_monotonic_constraints_classifications:

assert np.all(est.predict_proba(X_test_0incr)[:, 1] >= proba_test[:, 1])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nextafter from python as a steps parameter.

Indeed, this would be useful, although it was only added in Python 3.12...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arff my bad. I did not check the full documentation. Python 3.12 is quite ahead.

dest[0] = scaled_upper_bound

# Values for binary classification must sum to total_weighted_count.
dest[1] = total_weighted_count - dest[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dest[1] = total_weighted_count - dest[0]
dest[1] = max(0, total_weighted_count - dest[0])

Do we need to protect dest[1] to be non-negative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower_bound and upper_bound are computed as positive class frequencies (or averages of positive class frequencies), so they should always be in the [0, 1] range, and thus dest[1] should always be non-negative.

If we decide to protect as you suggest though, note that we should make sure the invariant dest[0] + dest[1] = total_weighted_count holds.

Comment on lines 589 to 590
float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count
float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. It would be good to have a short summary of this as code comment.

Then, about the solution: By casting to float32, we loose 7 significant digits, while the loss of accuracy by a multiplication and a division is roughly 1 ulp. I would therefore prefer to have something like

float64_t scaled_lower_bound = nextafter(lower_bound, 1) * total_weighted_count
float64_t scaled_upper_bound = nextafter(upper_bound, 0) * total_weighted_count

Comment on lines +83 to +86
assert np.logical_and(
proba_test >= 0.0, proba_test <= 1.0
).all(), "Probability should always be in [0, 1] range."
assert_allclose(proba_test.sum(axis=1), 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do those tests fail without this PR? (They should!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    assert np.logical_and(
        proba_test >= 0.0, proba_test <= 1.0
    ).all(), "Probability should always be in [0, 1] range."

definitely fails.

    assert_allclose(proba_test.sum(axis=1), 1.0)

doesn't though as we made sure the sum is 1..

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give another round of review next week. I need to get deeper on the bottom of it.

@glemaitre glemaitre self-requested a review December 1, 2023 23:02
@lorentzenchr
Copy link
Member

I played a little with this PR and I think, the cleanest solution is to change the node values from counts to probabilities (as is done for regression trees and in HGBT):

In cdef class ClassificationCriterion(Criterion):, set

cdef void node_value(self, float64_t* dest) noexcept nogil:
    cdef intp_t k, c

    for k in range(self.n_outputs):
        for c in range(self.n_classes[k]):
            dest[c] = self.sum_total[k, c] / self.weighted_n_node_samples
        dest += self.max_n_classes

Then we don't need any lower/upper_bound clipping and can have almost the same as for regression criteria:

cdef void clip_node_value(
        self, float64_t * dest, float64_t lower_bound, float64_t upper_bound
    ) noexcept nogil:
    cdef:
    float64_t total_weighted_count = dest[0] + dest[1]

    if dest[0] < lower_bound:
        dest[0] = lower_bound
    elif dest[0] > upper_bound:
        dest[0] = upper_bound

    # Values for binary classification must sum to total_weighted_count.
    dest[1] = total_weighted_count - dest[0]

I did not benchmark, and Yes, it might have a veeeery minor impact on the tree fitting. We could, however, in a follow-up PR, make predict_proba faster, as the normalization step then is obsolete.

@samronsin
Copy link
Contributor Author

I think, the cleanest solution is to change the node values from counts to probabilities

This is also my favorite solution. And it actually would require no change in clip_node_value at all, because I wrongfully assumed node values were already probabilities in the main PR for monotonic constraints...

I'm actually surprised at how little the rest of the codebase is impacted by changes of the value of the decision trees, the only tests I had to modify were in tree/tests/test_export.py.

@lorentzenchr
Copy link
Member

lorentzenchr commented Dec 5, 2023

@samronsin Could you push those commits to make the node values probabilities instead of counts?

To fix the monotonicity in trees is a release blocker for 1.4.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Only minor changes.

For the 2. reviewer: This seems like a innocent change but it may have some yet undiscovered side effects. For instance, I don’t know if the tree export/visualize funtions need to be adapted, too.

samronsin and others added 2 commits December 6, 2023 09:10
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
tree.n_classes[0] == 1
and isinstance(node_val, Iterable)
and self.colors["bounds"] is not None
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed much cleaner.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes make everything much simpler. I'll give a deeper look at the export function later today. But for the fix on the tree it looks good.

@glemaitre glemaitre self-requested a review December 7, 2023 15:07
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@glemaitre glemaitre merged commit 8b95194 into scikit-learn:main Dec 7, 2023
@glemaitre
Copy link
Member

Thanks @samronsin and @lorentzenchr to have taking time addressing and converging towards the best solution.

@samronsin
Copy link
Contributor Author

Thank you @glemaitre and @lorentzenchr !

@jjerphan
Copy link
Member

jjerphan commented Dec 8, 2023

Thank you for fixing this, @glemaitre, @samronsin and @lorentzenchr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants