Skip to content

[MRG] ENH Add support for missing values to Tree based Classifiers #5974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Dec 7, 2015

Fixes #5870 (Adds support to tree based classifiers, excluding ensemble methods)


For current status, notes, references - https://github.com/raghavrv/sklearn_dev_sandbox/tree/master/tree_methods_missing_val_support


TODO:

  • Support missing_values in *Tree(s)Classifier / *ForestClassifier
    • Train with missing values
      • DepthFirstTreeBuilder
      • BestFirstTreeBuilder
      • ClassificationCriterion
      • BestSplitter
      • RandomSplitter
      • BestSparseSplitter
      • RandomSparseSplitter
    • Predict with missing values
      • Send randomly, if missing direction is undefined
      • apply_dense
      • apply_sparse_csc
  • Make this work with all the ensemble methods
  • Add drop_values function to generate missing values. - [MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084
  • Add Example 1 illustrating the method in comparison with imputer
  • Add Example 2 comparing MCAR and MNAR.
  • Add a section for narrative docs

NOTE:

  • The 2 other promising alternative methods are
    • Use surrogates to handle missing values as done in rpart - Seems promising with respect to the relative accuracy scores as reported by Ding and Simonoff's paper - Needs some refactoring to our API for this to work - Widely used - importantly this will work even if the training data had no missing values.
    • Probabilistic split - This basically sends the the missing-valued samples to both right and left but sets the weights of the samples (in the child) to the number of non-missing samples in the right (or left) split, such that the total weight at the parent node is 1. This seems to be the most widely used method apart from imputation. Gilles feels this cannot be easily accomplished with our current API.

CC: @agramfort @glouppe @jmschrei @arjoly @tguillemot

Thanks a lot to @glouppe, @agramfort, @TomDLT & @vighneshbirodkar for all the patience and help (in and out of github)!

@raghavrv raghavrv force-pushed the missing_values_rf branch 7 times, most recently from b5e8502 to e31c913 Compare December 8, 2015 02:18
@raghavrv raghavrv force-pushed the missing_values_rf branch 5 times, most recently from 737e500 to 4149080 Compare December 21, 2015 17:26
@raghavrv raghavrv force-pushed the missing_values_rf branch 3 times, most recently from 9784fec to f0b1a51 Compare January 3, 2016 17:06
@raghavrv raghavrv mentioned this pull request Jan 4, 2016
12 tasks
@raghavrv raghavrv changed the title [WIP][ENH] Random forests - Support missing values [WIP][ENH] Add support for missing values to Tree Classifiers/Regressors Jan 5, 2016
@raghavrv raghavrv force-pushed the missing_values_rf branch 2 times, most recently from 29ef725 to 015c242 Compare January 11, 2016 03:30
@raghavrv
Copy link
Member Author

@agramfort @glouppe Apologies for the ridiculous delay! The training now is done considering the missing values. Could you please take a look and tell me if my approach is correct?

@glouppe
Copy link
Contributor

glouppe commented Jan 11, 2016

I'll try to give you some feedback asap. Dont hesitate to ping if you dont see anything from me in the next days.

Also cc: @jmschrei @arjoly @pprett

@@ -66,12 +71,16 @@ cdef class Criterion:
The total weight of the samples being considered
samples: array-like, dtype=DOUBLE_t
Indices of the samples in X and y, where samples[start:end]
correspond to the samples in this node
correspond to the non-missing valued samples alone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-missing valued sounds weird. 'correspond to values which are not missing'

@Erotemic
Copy link
Contributor

Erotemic commented Mar 7, 2017

@raghavrv I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to?

I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there.

@raghavrv
Copy link
Member Author

raghavrv commented Mar 7, 2017

I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there

Oh really? Nice to hear! :) How did you find the performance and usability? Do you have any benchmarks comments or shortcomings to share. I'd be really interested... I could update it with current master if you want me to?

I got busy with another thing but the state of this PR is - Done and waiting for reviews. Except we have planned to avoid supporting all cases in the first go and instead add a small subset of what is here (say only for dense matrices, best splitter and depth first tree building) to help make the review a less daunting task...

@raghavrv
Copy link
Member Author

raghavrv commented Mar 7, 2017

I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to?

Ah sorry that was a specific question... :) I've not yet made those smaller branches yet... I'll try to make them soon. Next week? (It's always motivating if someone finds your work useful ;))

@Erotemic
Copy link
Contributor

Erotemic commented Mar 7, 2017

I've only been using this on a basic level by seeing missing_values=np.nan when I create a RandomForestClassifier. I have a special hand-crafted dense feature vector that I need to perform probabilistic 3-state classification on (determine if two images show the same individual animal: yes, no, or not-enough-information). Some of the measurements like time and GPS on the photos are unreliable, and when building my feature vector some dimensions will be nan, these dimensions will be MAR. In some cases the fact that a dimension is nan might even have significance (MNAR), which is what led me to this PR.

I would link to an example of my code, but unfortunately we've had to close source the repo for political reasons until the work is published. Instead I can link a gist shows a toy example that I made when testing out this PR: https://gist.github.com/Erotemic/c9532be23e21a44a63d0cc77ed2e65fd

As far as rebasing the branch on master, that would be extremely helpful. I've had to write a script to do this and recover from conflicts because I need to integrate this functionality into sklearn whenever I install a python environment on a new machine.

@afiodorov
Copy link

@raghavrv latest rebase lives here https://github.com/unravelin/scikit-learn/tree/rf_missing. I need this branch to have a better control of handling missing values at the company I work in.

@raghavrv
Copy link
Member Author

raghavrv commented May 2, 2017

@Erotemic and others - @glemaitre and myself are working on a rewrite of the tree code to make it parallel (and if possible, faster even for single-threaded mode)... I'm unable to spend some bandwidth to keep this branch rebased :/

@afiodorov Thanks for sharing the rebase! Much appreciated... If you could keep it rebased and synced with latest master for another month or two, it would be awesome! I'll take it from then... Possibly by introducing the missing value functionality into the new rewrite...

@Erotemic
Copy link
Contributor

Erotemic commented May 5, 2017

@raghavrv I completely understand the scarcity of bandwidth. Its much more important to have the rewrite of the branch (and I'm very much hoping it includes missing value support).

@ashimb9
Copy link
Contributor

ashimb9 commented Aug 20, 2017

Will this PR eventually support a random forest based missing data imputer in the vein of missForest? A quick read through the discussion suggests that it will not and will be mostly focused with fitting decision trees / RFs in the presence of NaN rather than imputing them. Is that correct or did I miss something? I ask because I was thinking about working on a missForest type Imputer but just wanted to make sure I was not duplicating any work. Would much appreciate your feedback. Thanks!

@Erotemic
Copy link
Contributor

@ashimb9 this branch will allow for feature vectors passed to the fit method of the random forest to contain nan values.

I'm not familiar with what a missForest is, but I can tell you that this PR does not define any new Inputers.

@ashimb9
Copy link
Contributor

ashimb9 commented Aug 21, 2017

@Erotemic Yeah, I did not think an Imputer was part of things but just wanted to make sure. Anyway, thank you for responding.

PS: In case you were wondering, missForest is an R package for imputing missing values using random forests.

@Yashg19
Copy link

Yashg19 commented Oct 9, 2017

Hey @raghavrv and others, do you know when this PR will be available in the production? It's definitely helpful. Thanks!

@jnothman
Copy link
Member

jnothman commented Oct 9, 2017 via email

@Gitman-code
Copy link

What is the status of this? Like many people I would be interested in the surrogate splitting method existing in the trees of sklearn.

@jnothman
Copy link
Member

@DrEhrfurchtgebietend perhaps see #5870. But note that raghavrv is no longer with us

@jjerphan
Copy link
Member

jjerphan commented Mar 8, 2023

Thank you for having explored this support.

In the meantime, the codebase evolved and #23595 is now more relevant than this PR, which I think can be closed.

@jjerphan jjerphan closed this Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Missing values in RandomForest