ENH support for missing values in ExtraTrees #27931

mglowacki100 · 2023-12-10T21:12:28Z

Describe the workflow you want to enable

Inspired by #26391 I think that support for missing values for ExtraTrees regressor and classifier should/could also be provided.

Describe your proposed solution

I think a foundational work is already provided by @thomasjpfan in #26391 and besides tests and documentation to enable nan handling it is enough to modify sklearn/tree/_classes.py:
For ExtraTreeRegressor add method:

 def _more_tags(self):
        # XXX: nan is only support for dense arrays, but we set this for common test to
        # pass, specifically: check_estimators_nan_inf
        allow_nan = self.criterion in {
            "squared_error",
            "friedman_mse",
            "poisson",
        }
        return {"allow_nan": allow_nan}

For ExtraTreeClassifier add method:

def _more_tags(self):
        # XXX: nan is only support for dense arrays, but we set this for common test to
        # pass, specifically: check_estimators_nan_inf
        allow_nan = self.criterion in {
            "gini",
            "log_loss",
            "entropy",
        }
        return {"multilabel": True, "allow_nan": allow_nan}

I've run the code locally, and it appears to be functioning as expected. However, I must emphasize that my testing was not exhaustive, and I might have overlooked some obvious aspects.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-12-11T10:53:42Z

This is not as simple, I don't think that the random splitter is currently supporting the missing values, e.g.:

scikit-learn/sklearn/tree/_splitter.pyx

Lines 829 to 832 in 0496f48

    
           # TODO: Pass in best.n_missing when random splitter supports missing values. 
        
           partitioner.partition_samples_final( 
        
               best_split.pos, best_split.threshold, best_split.feature, 0 
        
           )

So we need to change the splitter and not only the tags.

adam2392 · 2023-12-13T05:10:32Z

Out of curiosity is there desire to work on adding the missing-value support to the random splitter? I am familiar w/ the codebase and would love to help move that along if there's interest from the core-devs. Or is @thomasjpfan already working on it?

betatim · 2023-12-13T08:51:33Z

I think extending missing value to more tree based estimators is something that we'd like to see in the library. So a PR would be great! (There is of course always the caveat: someone has to have time to review it, and there might be other things that get reviewed first. But you've contributed to scikit-learn before so I think you know that :D)

thomasjpfan · 2023-12-13T13:56:34Z

I'll be happy to review a PR for adding missing value support to the random splitter. I recommend starting with enabling missing value support for the decision tree and a follow up PR that adds ExtraTrees.

As with the original missing values PR, we need to be careful and not introduce performance regressions.

There is one design decision to make that will influence the implementation: Do we want DecisionTreeClassifier(splitter="random", random_state=0) to give the same model with different scikit-learn versions?

adam2392 · 2023-12-15T18:16:30Z

There is one design decision to make that will influence the implementation: Do we want DecisionTreeClassifier(splitter="random", random_state=0) to give the same model with different scikit-learn versions?

My opinion is giving the same model sounds reasonable. I'm not too familiar w/ the intricacies here. Besides the random number generation, is there anything else we need to control?

thomasjpfan · 2024-01-20T20:15:46Z

Besides the random number generation, is there anything else we need to control?

The random number generation is the only thing to keep in mind.

My opinion is giving the same model sounds reasonable.

If we want to keep the same model, then I think the implementation becomes more complicated. We need to be very careful about drawing random values when considering the missing values. Any additional interaction with the original random state object will change the model. We likely need to create another rng object for any randomness used by missing value splitting.

Overall, I think it would be simpler to not have the same model. But it is a decision we need to make.

adam2392 · 2024-01-22T14:59:55Z

I see. Maintaining a separate random number generator purely for backwards compatibility sake seems quite cumbersome. As a user, I would be okay having not having the same model then given this cost.

adam2392 · 2024-01-25T21:49:52Z

@mglowacki100 perhaps you might be interested in taking a look at #27966 and seeing if it meets your interests?

@thomasjpfan I followed your workflow from the missing-value decisiontree PRs and think it might be ready for review if you end up having time.

Thanks!

mglowacki100 added Needs Triage Issue requires triage New Feature labels Dec 10, 2023

glemaitre removed the Needs Triage Issue requires triage label Dec 11, 2023

thomasjpfan added the module:tree label Dec 11, 2023

adam2392 mentioned this issue Dec 16, 2023

FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor #27966

Merged

adam2392 mentioned this issue Jan 25, 2024

FEA Support missing-values in ExtraTrees* #28268

Merged

adrinjalali mentioned this issue Apr 25, 2024

Add missing value support to ExtraTreesRegressor #28887

Closed

adrinjalali added this to Missing value and nan support Jun 5, 2024

adam2392 moved this to In Progress in Missing value and nan support Jul 9, 2024

OmarManzoor closed this as completed in #28268 Jul 10, 2024

github-project-automation bot moved this from In Progress to Done in Missing value and nan support Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH support for missing values in ExtraTrees #27931

ENH support for missing values in ExtraTrees #27931

mglowacki100 commented Dec 10, 2023

glemaitre commented Dec 11, 2023

adam2392 commented Dec 13, 2023

betatim commented Dec 13, 2023

thomasjpfan commented Dec 13, 2023

adam2392 commented Dec 15, 2023

thomasjpfan commented Jan 20, 2024

adam2392 commented Jan 22, 2024

adam2392 commented Jan 25, 2024

ENH support for missing values in ExtraTrees #27931

ENH support for missing values in ExtraTrees #27931

Comments

mglowacki100 commented Dec 10, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Dec 11, 2023

adam2392 commented Dec 13, 2023

betatim commented Dec 13, 2023

thomasjpfan commented Dec 13, 2023

adam2392 commented Dec 15, 2023

thomasjpfan commented Jan 20, 2024

adam2392 commented Jan 22, 2024

adam2392 commented Jan 25, 2024