[MRG] ENH Add support for missing values to Tree based Classifiers #5974

raghavrv · 2015-12-07T17:46:18Z

Fixes #5870 (Adds support to tree based classifiers, excluding ensemble methods)

For current status, notes, references - https://github.com/raghavrv/sklearn_dev_sandbox/tree/master/tree_methods_missing_val_support

TODO:

NOTE:

The 2 other promising alternative methods are
- Use surrogates to handle missing values as done in rpart - Seems promising with respect to the relative accuracy scores as reported by Ding and Simonoff's paper - Needs some refactoring to our API for this to work - Widely used - importantly this will work even if the training data had no missing values.
- Probabilistic split - This basically sends the the missing-valued samples to both right and left but sets the weights of the samples (in the child) to the number of non-missing samples in the right (or left) split, such that the total weight at the parent node is 1. This seems to be the most widely used method apart from imputation. Gilles feels this cannot be easily accomplished with our current API.

CC: @agramfort @glouppe @jmschrei @arjoly @tguillemot

Thanks a lot to @glouppe, @agramfort, @TomDLT & @vighneshbirodkar for all the patience and help (in and out of github)!

raghavrv · 2016-01-11T04:00:25Z

@agramfort @glouppe Apologies for the ridiculous delay! The training now is done considering the missing values. Could you please take a look and tell me if my approach is correct?

glouppe · 2016-01-11T09:39:54Z

I'll try to give you some feedback asap. Dont hesitate to ping if you dont see anything from me in the next days.

Also cc: @jmschrei @arjoly @pprett

jmschrei · 2016-01-11T23:22:07Z

sklearn/tree/_criterion.pyx

@@ -66,12 +71,16 @@ cdef class Criterion:
            The total weight of the samples being considered
        samples: array-like, dtype=DOUBLE_t
            Indices of the samples in X and y, where samples[start:end]
-            correspond to the samples in this node
+            correspond to the non-missing valued samples alone.


non-missing valued sounds weird. 'correspond to values which are not missing'

Erotemic · 2017-03-07T18:25:03Z

@raghavrv I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to?

I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there.

raghavrv · 2017-03-07T18:32:46Z

I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there

Oh really? Nice to hear! :) How did you find the performance and usability? Do you have any benchmarks comments or shortcomings to share. I'd be really interested... I could update it with current master if you want me to?

I got busy with another thing but the state of this PR is - Done and waiting for reviews. Except we have planned to avoid supporting all cases in the first go and instead add a small subset of what is here (say only for dense matrices, best splitter and depth first tree building) to help make the review a less daunting task...

raghavrv · 2017-03-07T18:34:46Z

I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to?

Ah sorry that was a specific question... :) I've not yet made those smaller branches yet... I'll try to make them soon. Next week? (It's always motivating if someone finds your work useful ;))

Erotemic · 2017-03-07T23:45:10Z

I've only been using this on a basic level by seeing missing_values=np.nan when I create a RandomForestClassifier. I have a special hand-crafted dense feature vector that I need to perform probabilistic 3-state classification on (determine if two images show the same individual animal: yes, no, or not-enough-information). Some of the measurements like time and GPS on the photos are unreliable, and when building my feature vector some dimensions will be nan, these dimensions will be MAR. In some cases the fact that a dimension is nan might even have significance (MNAR), which is what led me to this PR.

I would link to an example of my code, but unfortunately we've had to close source the repo for political reasons until the work is published. Instead I can link a gist shows a toy example that I made when testing out this PR: https://gist.github.com/Erotemic/c9532be23e21a44a63d0cc77ed2e65fd

As far as rebasing the branch on master, that would be extremely helpful. I've had to write a script to do this and recover from conflicts because I need to integrate this functionality into sklearn whenever I install a python environment on a new machine.

afiodorov · 2017-05-02T10:02:16Z

@raghavrv latest rebase lives here https://github.com/unravelin/scikit-learn/tree/rf_missing. I need this branch to have a better control of handling missing values at the company I work in.

raghavrv · 2017-05-02T12:27:58Z

@Erotemic and others - @glemaitre and myself are working on a rewrite of the tree code to make it parallel (and if possible, faster even for single-threaded mode)... I'm unable to spend some bandwidth to keep this branch rebased :/

@afiodorov Thanks for sharing the rebase! Much appreciated... If you could keep it rebased and synced with latest master for another month or two, it would be awesome! I'll take it from then... Possibly by introducing the missing value functionality into the new rewrite...

Erotemic · 2017-05-05T14:08:53Z

@raghavrv I completely understand the scarcity of bandwidth. Its much more important to have the rewrite of the branch (and I'm very much hoping it includes missing value support).

ashimb9 · 2017-08-20T16:25:59Z

Will this PR eventually support a random forest based missing data imputer in the vein of missForest? A quick read through the discussion suggests that it will not and will be mostly focused with fitting decision trees / RFs in the presence of NaN rather than imputing them. Is that correct or did I miss something? I ask because I was thinking about working on a missForest type Imputer but just wanted to make sure I was not duplicating any work. Would much appreciate your feedback. Thanks!

Erotemic · 2017-08-20T18:22:49Z

@ashimb9 this branch will allow for feature vectors passed to the fit method of the random forest to contain nan values.

I'm not familiar with what a missForest is, but I can tell you that this PR does not define any new Inputers.

ashimb9 · 2017-08-21T03:38:14Z

@Erotemic Yeah, I did not think an Imputer was part of things but just wanted to make sure. Anyway, thank you for responding.

PS: In case you were wondering, missForest is an R package for imputing missing values using random forests.

Yashg19 · 2017-10-09T16:18:41Z

Hey @raghavrv and others, do you know when this PR will be available in the production? It's definitely helpful. Thanks!

jnothman · 2017-10-09T22:43:21Z

Raghav won't be completing this. I think the main concern here was that it was hard to show that it was, in practice helpful, to the extent of justifying the added complexity. Do you have available example datasets to show that this is better than imputation (with the algorithms of https://github.com/hammerlab//fancyimpute for example)?

Gitman-code · 2018-03-15T21:49:35Z

What is the status of this? Like many people I would be interested in the surrogate splitting method existing in the trees of sklearn.

jnothman · 2018-03-16T00:02:23Z

@DrEhrfurchtgebietend perhaps see #5870. But note that raghavrv is no longer with us

jjerphan · 2023-03-08T09:52:29Z

Thank you for having explored this support.

In the meantime, the codebase evolved and #23595 is now more relevant than this PR, which I think can be closed.

raghavrv force-pushed the missing_values_rf branch 7 times, most recently from b5e8502 to e31c913 Compare December 8, 2015 02:18

raghavrv force-pushed the missing_values_rf branch 5 times, most recently from 737e500 to 4149080 Compare December 21, 2015 17:26

raghavrv force-pushed the missing_values_rf branch from 4149080 to 91c8164 Compare December 28, 2015 07:18

raghavrv mentioned this pull request Dec 28, 2015

[MRG] COSMIT Use consts instead of numerical values for better readability #6013

Closed

raghavrv force-pushed the missing_values_rf branch 3 times, most recently from 9784fec to f0b1a51 Compare January 3, 2016 17:06

raghavrv force-pushed the missing_values_rf branch from f0b1a51 to 1d4f3a4 Compare January 4, 2016 13:52

raghavrv mentioned this pull request Jan 4, 2016

[RFC] Tree module improvements #5212

Open

12 tasks

raghavrv changed the title ~~[WIP][ENH] Random forests - Support missing values~~ [WIP][ENH] Add support for missing values to Tree Classifiers/Regressors Jan 5, 2016

raghavrv force-pushed the missing_values_rf branch 2 times, most recently from 29ef725 to 015c242 Compare January 11, 2016 03:30

raghavrv force-pushed the missing_values_rf branch from 015c242 to e001db5 Compare January 11, 2016 04:25

jmschrei reviewed Jan 11, 2016
View reviewed changes

raghavrv removed the Waiting for Reviewer label Mar 24, 2017

raghavrv mentioned this pull request Jul 18, 2017

[RFC] Missing values in RandomForest #5870

Closed

ashimb9 mentioned this pull request Aug 21, 2017

Random Forest Imputation #9591

Closed

jnothman added Need Contributor Stalled labels Oct 9, 2017

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

github-actions bot added module:ensemble module:tree module:utils labels Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:48

glemaitre mentioned this pull request Mar 26, 2021

Random Forest Prediction and np.nan #19767

Closed

thomasjpfan added the cython label Apr 13, 2021

Erotemic mentioned this pull request Sep 6, 2022

Integration with WATCH GERSL/pycold#7

Open

jjerphan closed this Mar 8, 2023

Uh oh!

[MRG] ENH Add support for missing values to Tree based Classifiers #5974

[MRG] ENH Add support for missing values to Tree based Classifiers #5974

Uh oh!

Conversation

raghavrv commented Dec 7, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Jan 11, 2016

Uh oh!

glouppe commented Jan 11, 2016

Uh oh!

jmschrei Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

Erotemic commented Mar 7, 2017

Uh oh!

raghavrv commented Mar 7, 2017

Uh oh!

raghavrv commented Mar 7, 2017

Uh oh!

Erotemic commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afiodorov commented May 2, 2017

Uh oh!

raghavrv commented May 2, 2017

Uh oh!

Erotemic commented May 5, 2017

Uh oh!

ashimb9 commented Aug 20, 2017

Uh oh!

Erotemic commented Aug 20, 2017

Uh oh!

ashimb9 commented Aug 21, 2017

Uh oh!

Yashg19 commented Oct 9, 2017

Uh oh!

jnothman commented Oct 9, 2017 via email

Uh oh!

Gitman-code commented Mar 15, 2018

Uh oh!

jnothman commented Mar 16, 2018

Uh oh!

jjerphan commented Mar 8, 2023

Uh oh!

Uh oh!

raghavrv commented Dec 7, 2015 •

edited

Loading

Erotemic commented Mar 7, 2017 •

edited

Loading