-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH Add support for missing values to Tree based Classifiers #5974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b5e8502
to
e31c913
Compare
737e500
to
4149080
Compare
4149080
to
91c8164
Compare
9784fec
to
f0b1a51
Compare
f0b1a51
to
1d4f3a4
Compare
29ef725
to
015c242
Compare
@agramfort @glouppe Apologies for the ridiculous delay! The training now is done considering the missing values. Could you please take a look and tell me if my approach is correct? |
015c242
to
e001db5
Compare
sklearn/tree/_criterion.pyx
Outdated
@@ -66,12 +71,16 @@ cdef class Criterion: | |||
The total weight of the samples being considered | |||
samples: array-like, dtype=DOUBLE_t | |||
Indices of the samples in X and y, where samples[start:end] | |||
correspond to the samples in this node | |||
correspond to the non-missing valued samples alone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-missing valued sounds weird. 'correspond to values which are not missing'
@raghavrv I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to? I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there. |
Oh really? Nice to hear! :) How did you find the performance and usability? Do you have any benchmarks comments or shortcomings to share. I'd be really interested... I could update it with current master if you want me to? I got busy with another thing but the state of this PR is - Done and waiting for reviews. Except we have planned to avoid supporting all cases in the first go and instead add a small subset of what is here (say only for dense matrices, best splitter and depth first tree building) to help make the review a less daunting task... |
Ah sorry that was a specific question... :) I've not yet made those smaller branches yet... I'll try to make them soon. Next week? (It's always motivating if someone finds your work useful ;)) |
I've only been using this on a basic level by seeing I would link to an example of my code, but unfortunately we've had to close source the repo for political reasons until the work is published. Instead I can link a gist shows a toy example that I made when testing out this PR: https://gist.github.com/Erotemic/c9532be23e21a44a63d0cc77ed2e65fd As far as rebasing the branch on master, that would be extremely helpful. I've had to write a script to do this and recover from conflicts because I need to integrate this functionality into sklearn whenever I install a python environment on a new machine. |
@raghavrv latest rebase lives here https://github.com/unravelin/scikit-learn/tree/rf_missing. I need this branch to have a better control of handling missing values at the company I work in. |
@Erotemic and others - @glemaitre and myself are working on a rewrite of the tree code to make it parallel (and if possible, faster even for single-threaded mode)... I'm unable to spend some bandwidth to keep this branch rebased :/ @afiodorov Thanks for sharing the rebase! Much appreciated... If you could keep it rebased and synced with latest master for another month or two, it would be awesome! I'll take it from then... Possibly by introducing the missing value functionality into the new rewrite... |
@raghavrv I completely understand the scarcity of bandwidth. Its much more important to have the rewrite of the branch (and I'm very much hoping it includes missing value support). |
Will this PR eventually support a random forest based missing data imputer in the vein of missForest? A quick read through the discussion suggests that it will not and will be mostly focused with fitting decision trees / RFs in the presence of NaN rather than imputing them. Is that correct or did I miss something? I ask because I was thinking about working on a missForest type Imputer but just wanted to make sure I was not duplicating any work. Would much appreciate your feedback. Thanks! |
@ashimb9 this branch will allow for feature vectors passed to the I'm not familiar with what a |
@Erotemic Yeah, I did not think an Imputer was part of things but just wanted to make sure. Anyway, thank you for responding. PS: In case you were wondering, missForest is an R package for imputing missing values using random forests. |
Hey @raghavrv and others, do you know when this PR will be available in the production? It's definitely helpful. Thanks! |
Raghav won't be completing this.
I think the main concern here was that it was hard to show that it was, in
practice helpful, to the extent of justifying the added complexity. Do you
have available example datasets to show that this is better than imputation
(with the algorithms of https://github.com/hammerlab//fancyimpute for
example)?
|
What is the status of this? Like many people I would be interested in the surrogate splitting method existing in the trees of sklearn. |
@DrEhrfurchtgebietend perhaps see #5870. But note that raghavrv is no longer with us |
Thank you for having explored this support. In the meantime, the codebase evolved and #23595 is now more relevant than this PR, which I think can be closed. |
Fixes #5870 (Adds support to tree based classifiers, excluding ensemble methods)
For current status, notes, references - https://github.com/raghavrv/sklearn_dev_sandbox/tree/master/tree_methods_missing_val_support
TODO:
*Tree(s)Classifier
/*ForestClassifier
DepthFirstTreeBuilder
BestFirstTreeBuilder
ClassificationCriterion
BestSplitter
RandomSplitter
BestSparseSplitter
RandomSparseSplitter
apply_dense
apply_sparse_csc
drop_values
function to generate missing values. - [MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084NOTE:
rpart
- Seems promising with respect to the relative accuracy scores as reported by Ding and Simonoff's paper - Needs some refactoring to our API for this to work - Widely used - importantly this will work even if the training data had no missing values.CC: @agramfort @glouppe @jmschrei @arjoly @tguillemot
Thanks a lot to @glouppe, @agramfort, @TomDLT & @vighneshbirodkar for all the patience and help (in and out of github)!