-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX Fix ExtraTreeRegressor missing data handling #30318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Fix ExtraTreeRegressor missing data handling #30318
Conversation
test_regression_extra_tree_missing_values_toy
@@ -2689,10 +2689,8 @@ def test_regression_tree_missing_values_toy(Tree, X, criterion): | |||
impurity = tree.tree_.impurity | |||
assert all(impurity >= 0), impurity.min() # MSE should always be positive | |||
|
|||
# Note: the impurity matches after the first split only on greedy trees | |||
if Tree is DecisionTreeRegressor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the fix impurities match even for ExtraTreeRegressor, so I removed the if
clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm this is actually something that we wonder with @adam2392 in his PR originally. So it might have been this hidden bug.
support missing-values in the data matrix ``X``. Missing-values are handled by | ||
randomly moving all of the samples to the left, or right child node as the tree is | ||
traversed. | ||
By :user:`Adam Li <adam2392>` and :user:`Loïc Estève <lesteve>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the feature has not been released and assuming we want the fix for 1.6, I use the same content so that towncrier merges them and mention both PRs. I added the "No Changelog needed" to make the changelog check green (another instance of #30222)
To be honest I would be fine not adding a changelog as well, let me know what you think ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind. Maybe this is a nice implementation to have actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the pair-debugging session, I'm convinced that this is the right fix ;)
LGTM. Thanks @lesteve to have spotted this one.
Tagging as blocker since we need to backport this one. |
Interesting so is my interpretation correct: There was essentially a hidden bug where the partitioning was occurring to move samples left or right that included missing values when it shouldn't have? |
SummaryMy understanding was that the split position was wrong because it was taking into account unitialised values at the end of Debugging sessionWith this diff removing the fix and adding debug code: diff --git a/sklearn/tree/_partitioner.pyx b/sklearn/tree/_partitioner.pyx
index 195b7e2caf..d3dddcbebd 100644
--- a/sklearn/tree/_partitioner.pyx
+++ b/sklearn/tree/_partitioner.pyx
@@ -194,10 +194,17 @@ cdef class DensePartitioner:
"""Partition samples for feature_values at the current_threshold."""
cdef:
intp_t p = self.start
- intp_t partition_end = self.end - self.n_missing
+ intp_t partition_end = self.end # - self.n_missing
intp_t[::1] samples = self.samples
float32_t[::1] feature_values = self.feature_values
+ with gil:
+ print('current_threshold', current_threshold)
+ print('before partition feature_values', np.asarray(feature_values))
+ print('partition_end', partition_end)
+ print('partition_end - self.n_missing', partition_end - self.n_missing)
+
+
while p < partition_end:
if feature_values[p] <= current_threshold:
p += 1
@@ -209,6 +216,10 @@ cdef class DensePartitioner:
)
samples[p], samples[partition_end] = samples[partition_end], samples[p]
+ with gil:
+ print('after partition feature_values', np.asarray(feature_values))
+ print('split_pos', partition_end)
+
return partition_end
cdef inline void partition_samples_final( and this Python snippet: import numpy as np
from sklearn.tree import ExtraTreeRegressor, DecisionTreeRegressor
from sklearn.tree import export_text
for i in range(2):
print("-" * 80)
print(f"{i}")
print("-" * 80)
X = np.array([1, 2, 3, 4, np.nan, np.nan])
criterion = "squared_error"
Tree = ExtraTreeRegressor
X = X.reshape(-1, 1)
y = np.arange(6)
tree = Tree(criterion=criterion, random_state=0, max_depth=2).fit(X, y)
print(export_text(tree))
n_node_samples = tree.tree_.n_node_samples
missing_go_to_left = tree.tree_.missing_go_to_left
print(f"{n_node_samples=}")
print(f"{missing_go_to_left=}")
impurity = tree.tree_.impurity
print("impurity", impurity)
assert all(impurity >= 0), impurity # MSE should always be positive Output:
Debugging AnalysisOn iteration 0 you can tell that there is On iteration 1, the uninitialised values are different and somehow this ends up creating negative impurities and the For completeness, here is the output with the fix:
Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for catching this and giving it the proper fix @lesteve !
Thanks for the fix. This bug was probably the cause of one of the failure I observed when randomizing the tests last summer: #29584 (comment) |
Originally started from #30258, this appears this is not free-threaded specific. This fails consistently on my machine on iteration 1 on
main
.Locally the added test fails for 38 out of the 100 seeds.
I made sure that the test was passing on all the random seeds in the CI on 54f22cf.
Fix #30258.
Debugged with @glemaitre, cc @adam2392 for his take on the fix.