-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX force node values outside of [0, 1] range for monotonically constraints classification trees #27639
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reporting this bug and for working on a fix, @samronsin!
sklearn/tree/_criterion.pyx
Outdated
if dest[0] < 0.: | ||
dest[0] = 0 | ||
elif dest[0] > 1.: | ||
dest[0] = 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces the support of values to be in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I was trying to understand what the lower and upper bound that are stored in the tree builder stack (and then passed to this function) are used for/how their value is computed but not sure.
The very first node that gets pushed to the stack uses -inf
and +inf
for the lower and upper bounds. Which would suggest it is possible for the values to be out side the range of zero to one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think clipping this is a good hot fix for the support and we may first proceed with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But values are not supposed to be in [0, 1].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But values are not supposed to be in [0, 1].
Exactly ! I was not aware of this until your comment #27639 (comment) @glemaitre.
From your minimal reproducer, I limited the depth to 1. I get the following tree: I assume that we don't compute the values properly because we always have It is a while I did not check the code of the tree. I will try to have a look. But it seems that we push wrong count statistic for the class 1 from the root node. |
sklearn/tree/_criterion.pyx
Outdated
# Class proportions for binary classification must sum to 1. | ||
dest[1] = 1 - dest[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so the bug is here. We don't expect to have a sum at 1 but a weighted count.
So we should have something like the sum_total - dest[0]
instead.
I don't know if we have access to sum_total
thought at this stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm actually this is not even sum_total
because this is grouped by labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with my limited knowledge of the wider implementation of the constraints, I would come with the following code:
cdef float64_t total_weighted_count = dest[0] + dest[1]
if dest[0] < lower_bound:
dest[0] = lower_bound
elif dest[0] > upper_bound:
dest[0] = upper_bound
dest[1] = total_weighted_count - dest[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking into this @glemaitre !
We don't expect to have a sum at 1 but a weighted count
I was not aware of this obviously, so this is a big mistake in the monotonic constraints implementation indeed !
Uhm actually this is not even sum_total because this is grouped by labels.
Monotonic constraints are only supported with single label, so this should not be an issue.
Note that lower_bound
and upper_bound
are positive class frequencies so we would need to modify your solution to
cdef float64_t total_weighted_count = dest[0] + dest[1]
if dest[0] < lower_bound * total_weighted_count:
dest[0] = lower_bound * total_weighted_count
elif dest[0] > upper_bound * total_weighted_count:
dest[0] = upper_bound * total_weighted_count
dest[1] = total_weighted_count - dest[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch almost works !
test_monotonic_constraints_classifications
currently fails due to floating point arithmetic fluctuations. The tests pass if the predicted probabilities are cast to np.float32
so the overall logic is valid, which is good news.
I'm looking into the float64
precision matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Now we only have to deal with the 25 CIs with some 32-bits arch ;)
If we have some casting in float32, we can probably play with the relative tolerance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the numerical noise comes from the predictions on clipped leaves:
(lower_bound * total_weighted_count) / total_weighted_count
which do vary with total_weighted_count
(in floating point arithmetic) close to lower_bound
. So two leaves can have values (before clipping) trespassing the same constraints, say lower_bound
, but with different total_weighted_count
and their predictions fluctuate a bit and sometimes break the monotonicity.
I found a solution that works by loosing precision on lower_bound
(and upper_bound
) before performing the scaling:
cdef float64_t total_weighted_count = dest[0] + dest[1]
cdef float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count
cdef float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count
if dest[0] < scaled_lower_bound:
dest[0] = scaled_lower_bound
elif dest[0] > scaled_upper_bound:
dest[0] = scaled_upper_bound
dest[1] = total_weighted_count - dest[0]
However, it does have limitations:
sample_weights
that have a precision offloat32
or more (testedfloat16
and this is OK)n_samples
too close to2**32
, so larger than a few hundred of millions (not tested forn_samples
beyond a few millions though...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the best that we can do. I don't know if we should have a check on the sample weight then in this specific setting.
I am adding the 1.4 milestone and blocker tag since we cannot release with this bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm apparently I have some comments that I forgot to post.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on my side. Thanks @samronsin
sklearn/tree/_criterion.pyx
Outdated
float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count | ||
float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count | |
float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count | |
float64_t scaled_lower_bound = <float32_t>(lower_bound * total_weighted_count) | |
float64_t scaled_upper_bound = <float32_t>(upper_bound * total_weighted_count) |
To void early loss of precision.
Why do we need to cast to float32 at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To void early loss of precision.
Why do we need to cast to float32 at all?
The precision loss for lower_bound
and upper_bound
(and not for their scaled counterpart) is by design.
The subtlety here is that for leaves affected by clipping (say constrained by lower_bound
), the predicted probabilities (without casting) would be (lower_bound * total_weighted_count) / total_weighted_count
, which varies with total_weighted_count
close to lower_bound
.
So if two leaves end up having the same lower_bound
desired prediction, but two different total_weighted_count
, their predictions might end up being (slightly) different and then (slightly) breaking the monotonicity constraint.
By casting lower_bound
to float32
we make ((<float32_t>lower_bound) * total_weighted_count) / total_weighted_count
effectively constant for a pretty large set of total_weighted_count
values, most interestingly integers < 2**30. The reason is that:
- the multiplication by
total_weighted_count
in this subset of values can be done without rounding - the division by
total_weighted_count
(the same number) will be exact thus requiring no rounding either and equal to the initial<float32_t>lower_bound
.
Please let me know if any of this is not clear enough !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. It would be good to have a short summary of this as code comment.
Then, about the solution: By casting to float32, we loose 7 significant digits, while the loss of accuracy by a multiplication and a division is roughly 1 ulp. I would therefore prefer to have something like
float64_t scaled_lower_bound = nextafter(lower_bound, 1) * total_weighted_count
float64_t scaled_upper_bound = nextafter(upper_bound, 0) * total_weighted_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions. Adding an explanatory comment sounds like a good idea, not sure how short I can make it though...
Regarding the precision loss on lower_bound
and upper_bound
, we want the sequence of multiplication and division by the same number to be exactly neutral, which requires more loss of precision than a single bit.
As you can see
def check_nextafter(a, n): return (nextafter(a, 1) * n) / n == nextafter(a, 1)
check_nextafter(0.8793444460978181, 700)
>>> False
whereas
def check_castfloat32(a, n): return (np.float32(a) * n) / n == np.float32(a)
requires n > 2**29 for returning False
, which seems good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a fair comparison and not the one we are interested in. We want scaled_lower_bound / total_weighted_count >= lower_bound
, but as close as possible (that's where float32 sucks, in the end, it might not influence the estimator in a noticable way).
from math import nextafter
import numpy as np
def check_nextafter(a, n):
"""a between 0 and 1, n any number"""
return (nextafter(a, 1) * n) / n >= a
sum([not check_nextafter(0.8793444460978181, n) for n in range(1, 1000_000)]) # 0
sum([not check_nextafter(0.8793444460978181, n) for n in np.logspace(0, 100, 1_000_000)]) # 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which assert in which test does fail?
I don't now of a "multi_nextafter", but we could just use lower_bound * (1 + x * np.finfo(float).eps)
and lower_bound * (1 - x * np.finfo(float).epsneg)
for some x > 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some x > 1
If this is for the steps, the nextafter
from python as a steps
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which assert in which test does fail?
In test_monotonic_constraints_classifications
:
assert np.all(est.predict_proba(X_test_0incr)[:, 1] >= proba_test[:, 1])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the
nextafter
from python as asteps
parameter.
Indeed, this would be useful, although it was only added in Python 3.12...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arff my bad. I did not check the full documentation. Python 3.12 is quite ahead.
sklearn/tree/_criterion.pyx
Outdated
dest[0] = scaled_upper_bound | ||
|
||
# Values for binary classification must sum to total_weighted_count. | ||
dest[1] = total_weighted_count - dest[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dest[1] = total_weighted_count - dest[0] | |
dest[1] = max(0, total_weighted_count - dest[0]) |
Do we need to protect dest[1]
to be non-negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower_bound
and upper_bound
are computed as positive class frequencies (or averages of positive class frequencies), so they should always be in the [0, 1]
range, and thus dest[1]
should always be non-negative.
If we decide to protect as you suggest though, note that we should make sure the invariant dest[0] + dest[1] = total_weighted_count
holds.
sklearn/tree/_criterion.pyx
Outdated
float64_t scaled_lower_bound = (<float32_t>lower_bound) * total_weighted_count | ||
float64_t scaled_upper_bound = (<float32_t>upper_bound) * total_weighted_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. It would be good to have a short summary of this as code comment.
Then, about the solution: By casting to float32, we loose 7 significant digits, while the loss of accuracy by a multiplication and a division is roughly 1 ulp. I would therefore prefer to have something like
float64_t scaled_lower_bound = nextafter(lower_bound, 1) * total_weighted_count
float64_t scaled_upper_bound = nextafter(upper_bound, 0) * total_weighted_count
assert np.logical_and( | ||
proba_test >= 0.0, proba_test <= 1.0 | ||
).all(), "Probability should always be in [0, 1] range." | ||
assert_allclose(proba_test.sum(axis=1), 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do those tests fail without this PR? (They should!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert np.logical_and(
proba_test >= 0.0, proba_test <= 1.0
).all(), "Probability should always be in [0, 1] range."
definitely fails.
assert_allclose(proba_test.sum(axis=1), 1.0)
doesn't though as we made sure the sum is 1.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give another round of review next week. I need to get deeper on the bottom of it.
I played a little with this PR and I think, the cleanest solution is to change the node values from counts to probabilities (as is done for regression trees and in HGBT): In cdef void node_value(self, float64_t* dest) noexcept nogil:
cdef intp_t k, c
for k in range(self.n_outputs):
for c in range(self.n_classes[k]):
dest[c] = self.sum_total[k, c] / self.weighted_n_node_samples
dest += self.max_n_classes Then we don't need any lower/upper_bound clipping and can have almost the same as for regression criteria: cdef void clip_node_value(
self, float64_t * dest, float64_t lower_bound, float64_t upper_bound
) noexcept nogil:
cdef:
float64_t total_weighted_count = dest[0] + dest[1]
if dest[0] < lower_bound:
dest[0] = lower_bound
elif dest[0] > upper_bound:
dest[0] = upper_bound
# Values for binary classification must sum to total_weighted_count.
dest[1] = total_weighted_count - dest[0] I did not benchmark, and Yes, it might have a veeeery minor impact on the tree fitting. We could, however, in a follow-up PR, make |
This is also my favorite solution. And it actually would require no change in I'm actually surprised at how little the rest of the codebase is impacted by changes of the |
@samronsin Could you push those commits to make the node values probabilities instead of counts? To fix the monotonicity in trees is a release blocker for 1.4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Only minor changes.
For the 2. reviewer: This seems like a innocent change but it may have some yet undiscovered side effects. For instance, I don’t know if the tree export/visualize funtions need to be adapted, too.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
tree.n_classes[0] == 1 | ||
and isinstance(node_val, Iterable) | ||
and self.colors["bounds"] is not None | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those changes make everything much simpler. I'll give a deeper look at the export function later today. But for the fix on the tree it looks good.
… and ExtraTreeClassifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @samronsin and @lorentzenchr to have taking time addressing and converging towards the best solution. |
Thank you @glemaitre and @lorentzenchr ! |
Thank you for fixing this, @glemaitre, @samronsin and @lorentzenchr. |
Reference Issues/PRs
Follow up on #13649 to fix a bug where classification probabilities predictions are outside of the [0, 1] range.
What does this implement/fix? Explain your changes.
Currently on
main
branch, probabilities predicted by all tree-based models supportingmontonic_cst
can be (and often are) outside of the expected [0, 1] range when enforcing amontonic_cst
.😱 😱 😱 😱
Same with random splitter:
After the fix all the probabilities lie between 0 and 1 😅.
Any other comments?
I need to investigate further, but the bug seems another occurence (see #27630) of unexpected
middle_values
that are propagated down and end up producing unexpected node values through clipping.