FIX Fix ExtraTreeRegressor missing data handling #30318

lesteve · 2024-11-21T10:17:44Z

Originally started from #30258, this appears this is not free-threaded specific. This fails consistently on my machine on iteration 1 on main.

import numpy as np
from sklearn.tree import ExtraTreeRegressor, DecisionTreeRegressor
from sklearn.tree import export_text


if __name__ == "__main__":
    for i in range(10):
        X = np.array([1, 2, 3, 4, np.nan, np.nan])
        criterion = "squared_error"
        Tree = ExtraTreeRegressor
        X = X.reshape(-1, 1)
        y = np.arange(6)

        tree = Tree(criterion=criterion, random_state=0, max_depth=2).fit(X, y)
        print(export_text(tree))
        print(tree.tree_.children_left)
        print(tree.tree_.children_right)
        print(tree.tree_.n_node_samples)
        print(tree.tree_.missing_go_to_left)

        impurity = tree.tree_.impurity
        print('impurity', impurity)
        assert all(impurity >= 0), impurity  # MSE should always be positive

Locally the added test fails for 38 out of the 100 seeds.

I made sure that the test was passing on all the random seeds in the CI on 54f22cf.

Fix #30258.

Debugged with @glemaitre, cc @adam2392 for his take on the fix.

github-actions · 2024-11-21T10:19:09Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: a2ab1b0. Link to the linter CI: here}

test_regression_extra_tree_missing_values_toy

lesteve · 2024-11-21T15:17:12Z

sklearn/tree/tests/test_tree.py

@@ -2689,10 +2689,8 @@ def test_regression_tree_missing_values_toy(Tree, X, criterion):
    impurity = tree.tree_.impurity
    assert all(impurity >= 0), impurity.min()  # MSE should always be positive

-    # Note: the impurity matches after the first split only on greedy trees
-    if Tree is DecisionTreeRegressor:


With the fix impurities match even for ExtraTreeRegressor, so I removed the if clause

Uhm this is actually something that we wonder with @adam2392 in his PR originally. So it might have been this hidden bug.

lesteve · 2024-11-21T15:40:58Z

doc/whats_new/upcoming_changes/sklearn.tree/30318.feature.rst

+  support missing-values in the data matrix ``X``. Missing-values are handled by
+  randomly moving all of the samples to the left, or right child node as the tree is
+  traversed.
+  By :user:`Adam Li <adam2392>` and :user:`Loïc Estève <lesteve>`


Since the feature has not been released and assuming we want the fix for 1.6, I use the same content so that towncrier merges them and mention both PRs. I added the "No Changelog needed" to make the changelog check green (another instance of #30222)

To be honest I would be fine not adding a changelog as well, let me know what you think ...

I don't mind. Maybe this is a nice implementation to have actually.

glemaitre

With the pair-debugging session, I'm convinced that this is the right fix ;)

LGTM. Thanks @lesteve to have spotted this one.

glemaitre · 2024-11-21T23:09:45Z

Tagging as blocker since we need to backport this one.

adam2392 · 2024-11-21T23:24:15Z

Interesting so is my interpretation correct: There was essentially a hidden bug where the partitioning was occurring to move samples left or right that included missing values when it shouldn't have?

lesteve · 2024-11-22T11:26:17Z

Summary

My understanding was that the split position was wrong because it was taking into account unitialised values at the end of feature_values ¹ that correspond to nan values in X. By using partition_end = self.end - self.n_missing we ignore these unitialised values as we should for the computation of the split position.

Debugging session

With this diff removing the fix and adding debug code:

diff --git a/sklearn/tree/_partitioner.pyx b/sklearn/tree/_partitioner.pyx
index 195b7e2caf..d3dddcbebd 100644
--- a/sklearn/tree/_partitioner.pyx
+++ b/sklearn/tree/_partitioner.pyx
@@ -194,10 +194,17 @@ cdef class DensePartitioner:
         """Partition samples for feature_values at the current_threshold."""
         cdef:
             intp_t p = self.start
-            intp_t partition_end = self.end - self.n_missing
+            intp_t partition_end = self.end # - self.n_missing
             intp_t[::1] samples = self.samples
             float32_t[::1] feature_values = self.feature_values
 
+        with gil:
+            print('current_threshold', current_threshold)
+            print('before partition feature_values', np.asarray(feature_values))
+            print('partition_end', partition_end)
+            print('partition_end - self.n_missing', partition_end - self.n_missing)
+
+
         while p < partition_end:
             if feature_values[p] <= current_threshold:
                 p += 1
@@ -209,6 +216,10 @@ cdef class DensePartitioner:
                 )
                 samples[p], samples[partition_end] = samples[partition_end], samples[p]
 
+        with gil:
+            print('after partition feature_values', np.asarray(feature_values))
+            print('split_pos', partition_end)
+
         return partition_end
 
     cdef inline void partition_samples_final(

and this Python snippet:

import numpy as np
from sklearn.tree import ExtraTreeRegressor, DecisionTreeRegressor
from sklearn.tree import export_text


for i in range(2):
    print("-" * 80)
    print(f"{i}")
    print("-" * 80)
    X = np.array([1, 2, 3, 4, np.nan, np.nan])
    criterion = "squared_error"
    Tree = ExtraTreeRegressor
    X = X.reshape(-1, 1)
    y = np.arange(6)

    tree = Tree(criterion=criterion, random_state=0, max_depth=2).fit(X, y)
    print(export_text(tree))
    n_node_samples = tree.tree_.n_node_samples
    missing_go_to_left = tree.tree_.missing_go_to_left
    print(f"{n_node_samples=}")
    print(f"{missing_go_to_left=}")

    impurity = tree.tree_.impurity
    print("impurity", impurity)
    assert all(impurity >= 0), impurity  # MSE should always be positive

Output:

--------------------------------------------------------------------------------
0
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [1.0e+00 2.0e+00 3.0e+00 4.0e+00 4.5e-44 0.0e+00]
partition_end 6
partition_end - self.n_missing 4
after partition feature_values [1.0e+00 0.0e+00 4.5e-44 4.0e+00 3.0e+00 2.0e+00]
split_pos 3
|--- feature_0 <= 1.14
|   |--- value: [3.00]
|--- feature_0 >  1.14
|   |--- value: [2.00]

n_node_samples=array([6, 3, 3], dtype=int64)
missing_go_to_left=array([0, 0, 0], dtype=uint8)
impurity [2.91666667 4.66666667 0.66666667]
--------------------------------------------------------------------------------
1
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [1.0000000e+00 2.0000000e+00 3.0000000e+00 4.0000000e+00 1.4660158e+13
 1.7916666e+00]
partition_end 6
partition_end - self.n_missing 4
after partition feature_values [1.0000000e+00 3.0000000e+00 4.0000000e+00 1.4660158e+13 1.7916666e+00
 2.0000000e+00]
split_pos 1
current_threshold 3.1701456155488943
before partition feature_values [1.        3.        4.        2.        1.7916666 2.       ]
partition_end 6
partition_end - self.n_missing 4
after partition feature_values [1.        3.        2.        2.        1.7916666 4.       ]
split_pos 5
|--- feature_0 <= 1.14
|   |--- value: [0.00]
|--- feature_0 >  1.14
|   |--- feature_0 <= 3.17
|   |   |--- value: [3.00]
|   |--- feature_0 >  3.17
|   |   |--- value: [3.00]

n_node_samples=array([6, 1, 5, 4, 1], dtype=int64)
missing_go_to_left=array([0, 0, 0, 0, 0], dtype=uint8)
impurity [  2.91666667   0.           2.           9.88888889 -11.5       ]
Traceback (most recent call last):
  File "/tmp/test_tree.py", line 25, in <module>
    assert all(impurity >= 0), impurity  # MSE should always be positive
           ~~~^^^^^^^^^^^^^^^
AssertionError: [  2.91666667   0.           2.           9.88888889 -11.5       ]

Debugging Analysis

On iteration 0 you can tell that there is 1 2 3 4 that are in X and the nan are at the end but have uninitialised values in feature_values 1.46...e13 and 1.791.... The treshold is 1.13 so the split position should be 1 since the last two values correspond to nan and should be ignored. You can also tell something is off because n_node_samples is [6, 3, 3] and missing_go_to_left is [0, 0, 0] which does not make sense with a threshold of 1.13, the left child should have 1 sample.

On iteration 1, the uninitialised values are different and somehow this ends up creating negative impurities and the assert fails (I haven't fully tracked why) ...

For completeness, here is the output with the fix:

--------------------------------------------------------------------------------
0
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [1.0e+00 2.0e+00 3.0e+00 4.0e+00 4.5e-44 0.0e+00]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [1.0e+00 3.0e+00 4.0e+00 2.0e+00 4.5e-44 0.0e+00]
split_pos 1
current_threshold 3.1701456155488943
before partition feature_values [1.0e+00 3.0e+00 4.0e+00 2.0e+00 4.5e-44 0.0e+00]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [1.0e+00 3.0e+00 2.0e+00 4.0e+00 4.5e-44 0.0e+00]
split_pos 3
|--- feature_0 <= 1.14
|   |--- value: [0.00]
|--- feature_0 >  1.14
|   |--- feature_0 <= 3.17
|   |   |--- value: [1.50]
|   |--- feature_0 >  3.17
|   |   |--- value: [4.00]

n_node_samples=array([6, 1, 5, 2, 3], dtype=int64)
missing_go_to_left=array([0, 0, 0, 0, 0], dtype=uint8)
impurity [2.91666667 0.         2.         0.25       0.66666667]
--------------------------------------------------------------------------------
1
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [ 1.  2.  3.  4. nan nan]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [ 1.  3.  4.  2. nan nan]
split_pos 1
current_threshold 3.1701456155488943
before partition feature_values [ 1.  3.  4.  2. nan nan]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [ 1.  3.  2.  4. nan nan]
split_pos 3
|--- feature_0 <= 1.14
|   |--- value: [0.00]
|--- feature_0 >  1.14
|   |--- feature_0 <= 3.17
|   |   |--- value: [1.50]
|   |--- feature_0 >  3.17
|   |   |--- value: [4.00]

n_node_samples=array([6, 1, 5, 2, 3], dtype=int64)
missing_go_to_left=array([0, 0, 0, 0, 0], dtype=uint8)
impurity [2.91666667 0.         2.         0.25       0.66666667]

find_min_max is the method that moves uninitialised values to the end of feature_values ↩

adam2392

LGTM. Thanks for catching this and giving it the proper fix @lesteve !

ogrisel · 2024-11-25T16:26:00Z

Thanks for the fix. This bug was probably the cause of one of the failure I observed when randomizing the tests last summer: #29584 (comment)

wip

543baa1

lesteve marked this pull request as draft November 21, 2024 10:17

glemaitre self-requested a review November 21, 2024 10:17

github-actions bot added module:tree cython labels Nov 21, 2024

lesteve added 2 commits November 21, 2024 13:17

fix

9dce068

fix with test

54f22cf

lesteve mentioned this pull request Nov 21, 2024

⚠️ CI failed on Linux_free_threaded.pylatest_pip_free_threaded (last failure: Nov 10, 2024) ⚠️ #30258

Closed

test all random seeds [azure parallel] [all random seeds]

9b4998c

test_regression_extra_tree_missing_values_toy

lesteve commented Nov 21, 2024

View reviewed changes

lesteve changed the title ~~Investigate issue with ExtraTree missing values~~ FIX Fix ExtraTreeRegressor missing data handling Nov 21, 2024

lesteve marked this pull request as ready for review November 21, 2024 15:20

Add changelog

a2ab1b0

lesteve commented Nov 21, 2024

View reviewed changes

lesteve added the No Changelog Needed label Nov 21, 2024

glemaitre approved these changes Nov 21, 2024

View reviewed changes

glemaitre added this to the 1.6 milestone Nov 21, 2024

glemaitre added the Blocker label Nov 21, 2024

glemaitre added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 21, 2024

adam2392 approved these changes Nov 22, 2024

View reviewed changes

adam2392 merged commit 27a903b into scikit-learn:main Nov 22, 2024
42 of 43 checks passed

lesteve deleted the investigate-extratree-missing-values branch November 23, 2024 09:51

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024

FIX Fix ExtraTreeRegressor missing data handling (scikit-learn#30318)

d1115a9

jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024

FIX Fix ExtraTreeRegressor missing data handling (#30318)

ce2376e

virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024

FIX Fix ExtraTreeRegressor missing data handling (scikit-learn#30318)

69cffe1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX Fix ExtraTreeRegressor missing data handling #30318

FIX Fix ExtraTreeRegressor missing data handling #30318

Uh oh!

lesteve commented Nov 21, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Nov 21, 2024 •

edited

Loading

Uh oh!

lesteve Nov 21, 2024 •

edited

Loading

Uh oh!

glemaitre Nov 21, 2024

Uh oh!

lesteve Nov 21, 2024 •

edited

Loading

Uh oh!

glemaitre Nov 21, 2024

Uh oh!

glemaitre left a comment

Uh oh!

glemaitre commented Nov 21, 2024

Uh oh!

adam2392 commented Nov 21, 2024

Uh oh!

lesteve commented Nov 22, 2024 •

edited

Loading

Uh oh!

adam2392 left a comment

Uh oh!

Uh oh!

ogrisel commented Nov 25, 2024

Uh oh!

Uh oh!

Uh oh!

FIX Fix ExtraTreeRegressor missing data handling #30318

FIX Fix ExtraTreeRegressor missing data handling #30318

Uh oh!

Conversation

lesteve commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lesteve Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

lesteve Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Nov 21, 2024

Uh oh!

adam2392 commented Nov 21, 2024

Uh oh!

lesteve commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Debugging session

Debugging Analysis

Footnotes

Uh oh!

adam2392 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Nov 25, 2024

Uh oh!

Uh oh!

lesteve commented Nov 21, 2024 •

edited

Loading

github-actions bot commented Nov 21, 2024 •

edited

Loading

lesteve Nov 21, 2024 •

edited

Loading

lesteve Nov 21, 2024 •

edited

Loading

lesteve commented Nov 22, 2024 •

edited

Loading