Skip to content

FIX Fix ExtraTreeRegressor missing data handling #30318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Nov 21, 2024

Originally started from #30258, this appears this is not free-threaded specific. This fails consistently on my machine on iteration 1 on main.

import numpy as np
from sklearn.tree import ExtraTreeRegressor, DecisionTreeRegressor
from sklearn.tree import export_text


if __name__ == "__main__":
    for i in range(10):
        X = np.array([1, 2, 3, 4, np.nan, np.nan])
        criterion = "squared_error"
        Tree = ExtraTreeRegressor
        X = X.reshape(-1, 1)
        y = np.arange(6)

        tree = Tree(criterion=criterion, random_state=0, max_depth=2).fit(X, y)
        print(export_text(tree))
        print(tree.tree_.children_left)
        print(tree.tree_.children_right)
        print(tree.tree_.n_node_samples)
        print(tree.tree_.missing_go_to_left)

        impurity = tree.tree_.impurity
        print('impurity', impurity)
        assert all(impurity >= 0), impurity  # MSE should always be positive

Locally the added test fails for 38 out of the 100 seeds.

I made sure that the test was passing on all the random seeds in the CI on 54f22cf.

Fix #30258.

Debugged with @glemaitre, cc @adam2392 for his take on the fix.

@lesteve lesteve marked this pull request as draft November 21, 2024 10:17
@glemaitre glemaitre self-requested a review November 21, 2024 10:17
Copy link

github-actions bot commented Nov 21, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a2ab1b0. Link to the linter CI: here

test_regression_extra_tree_missing_values_toy
@@ -2689,10 +2689,8 @@ def test_regression_tree_missing_values_toy(Tree, X, criterion):
impurity = tree.tree_.impurity
assert all(impurity >= 0), impurity.min() # MSE should always be positive

# Note: the impurity matches after the first split only on greedy trees
if Tree is DecisionTreeRegressor:
Copy link
Member Author

@lesteve lesteve Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the fix impurities match even for ExtraTreeRegressor, so I removed the if clause

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm this is actually something that we wonder with @adam2392 in his PR originally. So it might have been this hidden bug.

@lesteve lesteve changed the title Investigate issue with ExtraTree missing values FIX Fix ExtraTreeRegressor missing data handling Nov 21, 2024
@lesteve lesteve marked this pull request as ready for review November 21, 2024 15:20
support missing-values in the data matrix ``X``. Missing-values are handled by
randomly moving all of the samples to the left, or right child node as the tree is
traversed.
By :user:`Adam Li <adam2392>` and :user:`Loïc Estève <lesteve>`
Copy link
Member Author

@lesteve lesteve Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the feature has not been released and assuming we want the fix for 1.6, I use the same content so that towncrier merges them and mention both PRs. I added the "No Changelog needed" to make the changelog check green (another instance of #30222)

To be honest I would be fine not adding a changelog as well, let me know what you think ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind. Maybe this is a nice implementation to have actually.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the pair-debugging session, I'm convinced that this is the right fix ;)

LGTM. Thanks @lesteve to have spotted this one.

@glemaitre glemaitre added this to the 1.6 milestone Nov 21, 2024
@glemaitre
Copy link
Member

Tagging as blocker since we need to backport this one.

@glemaitre glemaitre added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 21, 2024
@adam2392
Copy link
Member

Interesting so is my interpretation correct: There was essentially a hidden bug where the partitioning was occurring to move samples left or right that included missing values when it shouldn't have?

@lesteve
Copy link
Member Author

lesteve commented Nov 22, 2024

Summary

My understanding was that the split position was wrong because it was taking into account unitialised values at the end of feature_values 1 that correspond to nan values in X. By using partition_end = self.end - self.n_missing we ignore these unitialised values as we should for the computation of the split position.

Debugging session

With this diff removing the fix and adding debug code:

diff --git a/sklearn/tree/_partitioner.pyx b/sklearn/tree/_partitioner.pyx
index 195b7e2caf..d3dddcbebd 100644
--- a/sklearn/tree/_partitioner.pyx
+++ b/sklearn/tree/_partitioner.pyx
@@ -194,10 +194,17 @@ cdef class DensePartitioner:
         """Partition samples for feature_values at the current_threshold."""
         cdef:
             intp_t p = self.start
-            intp_t partition_end = self.end - self.n_missing
+            intp_t partition_end = self.end # - self.n_missing
             intp_t[::1] samples = self.samples
             float32_t[::1] feature_values = self.feature_values
 
+        with gil:
+            print('current_threshold', current_threshold)
+            print('before partition feature_values', np.asarray(feature_values))
+            print('partition_end', partition_end)
+            print('partition_end - self.n_missing', partition_end - self.n_missing)
+
+
         while p < partition_end:
             if feature_values[p] <= current_threshold:
                 p += 1
@@ -209,6 +216,10 @@ cdef class DensePartitioner:
                 )
                 samples[p], samples[partition_end] = samples[partition_end], samples[p]
 
+        with gil:
+            print('after partition feature_values', np.asarray(feature_values))
+            print('split_pos', partition_end)
+
         return partition_end
 
     cdef inline void partition_samples_final(

and this Python snippet:

import numpy as np
from sklearn.tree import ExtraTreeRegressor, DecisionTreeRegressor
from sklearn.tree import export_text


for i in range(2):
    print("-" * 80)
    print(f"{i}")
    print("-" * 80)
    X = np.array([1, 2, 3, 4, np.nan, np.nan])
    criterion = "squared_error"
    Tree = ExtraTreeRegressor
    X = X.reshape(-1, 1)
    y = np.arange(6)

    tree = Tree(criterion=criterion, random_state=0, max_depth=2).fit(X, y)
    print(export_text(tree))
    n_node_samples = tree.tree_.n_node_samples
    missing_go_to_left = tree.tree_.missing_go_to_left
    print(f"{n_node_samples=}")
    print(f"{missing_go_to_left=}")

    impurity = tree.tree_.impurity
    print("impurity", impurity)
    assert all(impurity >= 0), impurity  # MSE should always be positive

Output:

--------------------------------------------------------------------------------
0
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [1.0e+00 2.0e+00 3.0e+00 4.0e+00 4.5e-44 0.0e+00]
partition_end 6
partition_end - self.n_missing 4
after partition feature_values [1.0e+00 0.0e+00 4.5e-44 4.0e+00 3.0e+00 2.0e+00]
split_pos 3
|--- feature_0 <= 1.14
|   |--- value: [3.00]
|--- feature_0 >  1.14
|   |--- value: [2.00]

n_node_samples=array([6, 3, 3], dtype=int64)
missing_go_to_left=array([0, 0, 0], dtype=uint8)
impurity [2.91666667 4.66666667 0.66666667]
--------------------------------------------------------------------------------
1
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [1.0000000e+00 2.0000000e+00 3.0000000e+00 4.0000000e+00 1.4660158e+13
 1.7916666e+00]
partition_end 6
partition_end - self.n_missing 4
after partition feature_values [1.0000000e+00 3.0000000e+00 4.0000000e+00 1.4660158e+13 1.7916666e+00
 2.0000000e+00]
split_pos 1
current_threshold 3.1701456155488943
before partition feature_values [1.        3.        4.        2.        1.7916666 2.       ]
partition_end 6
partition_end - self.n_missing 4
after partition feature_values [1.        3.        2.        2.        1.7916666 4.       ]
split_pos 5
|--- feature_0 <= 1.14
|   |--- value: [0.00]
|--- feature_0 >  1.14
|   |--- feature_0 <= 3.17
|   |   |--- value: [3.00]
|   |--- feature_0 >  3.17
|   |   |--- value: [3.00]

n_node_samples=array([6, 1, 5, 4, 1], dtype=int64)
missing_go_to_left=array([0, 0, 0, 0, 0], dtype=uint8)
impurity [  2.91666667   0.           2.           9.88888889 -11.5       ]
Traceback (most recent call last):
  File "/tmp/test_tree.py", line 25, in <module>
    assert all(impurity >= 0), impurity  # MSE should always be positive
           ~~~^^^^^^^^^^^^^^^
AssertionError: [  2.91666667   0.           2.           9.88888889 -11.5       ]

Debugging Analysis

On iteration 0 you can tell that there is 1 2 3 4 that are in X and the nan are at the end but have uninitialised values in feature_values 1.46...e13 and 1.791.... The treshold is 1.13 so the split position should be 1 since the last two values correspond to nan and should be ignored. You can also tell something is off because n_node_samples is [6, 3, 3] and missing_go_to_left is [0, 0, 0] which does not make sense with a threshold of 1.13, the left child should have 1 sample.

On iteration 1, the uninitialised values are different and somehow this ends up creating negative impurities and the assert fails (I haven't fully tracked why) ...

For completeness, here is the output with the fix:

--------------------------------------------------------------------------------
0
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [1.0e+00 2.0e+00 3.0e+00 4.0e+00 4.5e-44 0.0e+00]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [1.0e+00 3.0e+00 4.0e+00 2.0e+00 4.5e-44 0.0e+00]
split_pos 1
current_threshold 3.1701456155488943
before partition feature_values [1.0e+00 3.0e+00 4.0e+00 2.0e+00 4.5e-44 0.0e+00]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [1.0e+00 3.0e+00 2.0e+00 4.0e+00 4.5e-44 0.0e+00]
split_pos 3
|--- feature_0 <= 1.14
|   |--- value: [0.00]
|--- feature_0 >  1.14
|   |--- feature_0 <= 3.17
|   |   |--- value: [1.50]
|   |--- feature_0 >  3.17
|   |   |--- value: [4.00]

n_node_samples=array([6, 1, 5, 2, 3], dtype=int64)
missing_go_to_left=array([0, 0, 0, 0, 0], dtype=uint8)
impurity [2.91666667 0.         2.         0.25       0.66666667]
--------------------------------------------------------------------------------
1
--------------------------------------------------------------------------------
current_threshold 1.1365261627996928
before partition feature_values [ 1.  2.  3.  4. nan nan]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [ 1.  3.  4.  2. nan nan]
split_pos 1
current_threshold 3.1701456155488943
before partition feature_values [ 1.  3.  4.  2. nan nan]
partition_end 4
partition_end - self.n_missing 2
after partition feature_values [ 1.  3.  2.  4. nan nan]
split_pos 3
|--- feature_0 <= 1.14
|   |--- value: [0.00]
|--- feature_0 >  1.14
|   |--- feature_0 <= 3.17
|   |   |--- value: [1.50]
|   |--- feature_0 >  3.17
|   |   |--- value: [4.00]

n_node_samples=array([6, 1, 5, 2, 3], dtype=int64)
missing_go_to_left=array([0, 0, 0, 0, 0], dtype=uint8)
impurity [2.91666667 0.         2.         0.25       0.66666667]

Footnotes

  1. find_min_max is the method that moves uninitialised values to the end of feature_values

Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for catching this and giving it the proper fix @lesteve !

@adam2392 adam2392 merged commit 27a903b into scikit-learn:main Nov 22, 2024
42 of 43 checks passed
@lesteve lesteve deleted the investigate-extratree-missing-values branch November 23, 2024 09:51
@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2024

Thanks for the fix. This bug was probably the cause of one of the failure I observed when randomizing the tests last summer: #29584 (comment)

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024
virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker cython module:tree No Changelog Needed To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

⚠️ CI failed on Linux_free_threaded.pylatest_pip_free_threaded (last failure: Nov 10, 2024) ⚠️
4 participants