Skip to content

ENH support for missing values in ExtraTrees #27931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mglowacki100 opened this issue Dec 10, 2023 · 8 comments · Fixed by #28268
Closed

ENH support for missing values in ExtraTrees #27931

mglowacki100 opened this issue Dec 10, 2023 · 8 comments · Fixed by #28268

Comments

@mglowacki100
Copy link

Describe the workflow you want to enable

Inspired by #26391 I think that support for missing values for ExtraTrees regressor and classifier should/could also be provided.

Describe your proposed solution

I think a foundational work is already provided by @thomasjpfan in #26391 and besides tests and documentation to enable nan handling it is enough to modify sklearn/tree/_classes.py:
For ExtraTreeRegressor add method:

 def _more_tags(self):
        # XXX: nan is only support for dense arrays, but we set this for common test to
        # pass, specifically: check_estimators_nan_inf
        allow_nan = self.criterion in {
            "squared_error",
            "friedman_mse",
            "poisson",
        }
        return {"allow_nan": allow_nan}

For ExtraTreeClassifier add method:

def _more_tags(self):
        # XXX: nan is only support for dense arrays, but we set this for common test to
        # pass, specifically: check_estimators_nan_inf
        allow_nan = self.criterion in {
            "gini",
            "log_loss",
            "entropy",
        }
        return {"multilabel": True, "allow_nan": allow_nan}

I've run the code locally, and it appears to be functioning as expected. However, I must emphasize that my testing was not exhaustive, and I might have overlooked some obvious aspects.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@mglowacki100 mglowacki100 added Needs Triage Issue requires triage New Feature labels Dec 10, 2023
@glemaitre
Copy link
Member

This is not as simple, I don't think that the random splitter is currently supporting the missing values, e.g.:

# TODO: Pass in best.n_missing when random splitter supports missing values.
partitioner.partition_samples_final(
best_split.pos, best_split.threshold, best_split.feature, 0
)

So we need to change the splitter and not only the tags.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Dec 11, 2023
@adam2392
Copy link
Member

Out of curiosity is there desire to work on adding the missing-value support to the random splitter? I am familiar w/ the codebase and would love to help move that along if there's interest from the core-devs. Or is @thomasjpfan already working on it?

@betatim
Copy link
Member

betatim commented Dec 13, 2023

I think extending missing value to more tree based estimators is something that we'd like to see in the library. So a PR would be great! (There is of course always the caveat: someone has to have time to review it, and there might be other things that get reviewed first. But you've contributed to scikit-learn before so I think you know that :D)

@thomasjpfan
Copy link
Member

I'll be happy to review a PR for adding missing value support to the random splitter. I recommend starting with enabling missing value support for the decision tree and a follow up PR that adds ExtraTrees.

As with the original missing values PR, we need to be careful and not introduce performance regressions.

There is one design decision to make that will influence the implementation: Do we want DecisionTreeClassifier(splitter="random", random_state=0) to give the same model with different scikit-learn versions?

@adam2392
Copy link
Member

There is one design decision to make that will influence the implementation: Do we want DecisionTreeClassifier(splitter="random", random_state=0) to give the same model with different scikit-learn versions?

My opinion is giving the same model sounds reasonable. I'm not too familiar w/ the intricacies here. Besides the random number generation, is there anything else we need to control?

@thomasjpfan
Copy link
Member

Besides the random number generation, is there anything else we need to control?

The random number generation is the only thing to keep in mind.

My opinion is giving the same model sounds reasonable.

If we want to keep the same model, then I think the implementation becomes more complicated. We need to be very careful about drawing random values when considering the missing values. Any additional interaction with the original random state object will change the model. We likely need to create another rng object for any randomness used by missing value splitting.

Overall, I think it would be simpler to not have the same model. But it is a decision we need to make.

@adam2392
Copy link
Member

I see. Maintaining a separate random number generator purely for backwards compatibility sake seems quite cumbersome. As a user, I would be okay having not having the same model then given this cost.

@adam2392
Copy link
Member

@mglowacki100 perhaps you might be interested in taking a look at #27966 and seeing if it meets your interests?

@thomasjpfan I followed your workflow from the missing-value decisiontree PRs and think it might be ready for review if you end up having time.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

5 participants