-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH support for missing values in ExtraTrees #27931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not as simple, I don't think that the random splitter is currently supporting the missing values, e.g.: scikit-learn/sklearn/tree/_splitter.pyx Lines 829 to 832 in 0496f48
So we need to change the splitter and not only the tags. |
Out of curiosity is there desire to work on adding the missing-value support to the random splitter? I am familiar w/ the codebase and would love to help move that along if there's interest from the core-devs. Or is @thomasjpfan already working on it? |
I think extending missing value to more tree based estimators is something that we'd like to see in the library. So a PR would be great! (There is of course always the caveat: someone has to have time to review it, and there might be other things that get reviewed first. But you've contributed to scikit-learn before so I think you know that :D) |
I'll be happy to review a PR for adding missing value support to the random splitter. I recommend starting with enabling missing value support for the decision tree and a follow up PR that adds ExtraTrees. As with the original missing values PR, we need to be careful and not introduce performance regressions. There is one design decision to make that will influence the implementation: Do we want |
My opinion is giving the same model sounds reasonable. I'm not too familiar w/ the intricacies here. Besides the random number generation, is there anything else we need to control? |
The random number generation is the only thing to keep in mind.
If we want to keep the same model, then I think the implementation becomes more complicated. We need to be very careful about drawing random values when considering the missing values. Any additional interaction with the original random state object will change the model. We likely need to create another rng object for any randomness used by missing value splitting. Overall, I think it would be simpler to not have the same model. But it is a decision we need to make. |
I see. Maintaining a separate random number generator purely for backwards compatibility sake seems quite cumbersome. As a user, I would be okay having not having the same model then given this cost. |
@mglowacki100 perhaps you might be interested in taking a look at #27966 and seeing if it meets your interests? @thomasjpfan I followed your workflow from the missing-value decisiontree PRs and think it might be ready for review if you end up having time. Thanks! |
Describe the workflow you want to enable
Inspired by #26391 I think that support for missing values for ExtraTrees regressor and classifier should/could also be provided.
Describe your proposed solution
I think a foundational work is already provided by @thomasjpfan in #26391 and besides tests and documentation to enable nan handling it is enough to modify
sklearn/tree/_classes.py
:For
ExtraTreeRegressor
add method:For
ExtraTreeClassifier
add method:I've run the code locally, and it appears to be functioning as expected. However, I must emphasize that my testing was not exhaustive, and I might have overlooked some obvious aspects.
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: