[RFC] Missing values in RandomForest

So far, we have been using our `preprocessing.Imputer` to take care of missing values before fitting the `RandomForest`. Ref: [this example](http://scikit-learn.org/dev/auto_examples/missing_values.html)

<hr>

**Proposal:** It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively...

<hr>

Have a `missing_values` variable which will be either `None` (To raise error when we run into MVs) or int/nan value that will act as placeholder for `missing_values`.

Different ways in which missing values can be handled (I might have naively added duplicates) -
1. (-1-1) Add an optional `imputation` variable, where we can either 
   - Specify the strategy `'mean'`, `'median'`, `'most_frequent'` (or `missing_value`?) and let the clf construct the `Imputer` on the fly...
   - Pass a built `Imputer` object (like we do for `scoring` or `cv`)
     This was the simplest approach. Variants are 5 and 6.
   
   Note that we can do this already using pipeline of imputer followed by the random forest.
2. (-1+1) Ignore the missing values at the time of generating the splits.
3. **(+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).**
   
   This is Gilles' suggestion. This is conceptually same as the "separate-class-method", where the missing values are considered as a separate categorical value. This is considered by [Ding and Simonoff's paper](http://people.stern.nyu.edu/jsimonof/jmlr10.pdf) to be the best method in different situations. 
4. As done in [`rpart`](https://cran.r-project.org/web/packages/rpart/index.html), we could use _sorrogate variables_, where the strategy is to basically use the other features to decide the split, if one feature goes missing...
5. Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref [Ding and Simonoff's paper](http://people.stern.nyu.edu/jsimonof/jmlr10.pdf)'s paper.
   
   _I think_, this goes something along the lines of this example
   [1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] -
       \* Split with available value is L--> [1, 2] R --> [3]
       \* Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3
6. Do imputation considering it as a supervised learning problem in itself, as done in [MissForest](https://cran.r-project.org/web/packages/missForest/index.html). Build using available data --> Predict the missing values using this built model.
7. Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer [Gilles' PhD](https://github.com/glouppe/phd-thesis/blob/master/thesis.pdf) Section 4.4.4)
8. Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer [this airbnb blog](http://nerds.airbnb.com/overcoming-missing-values-in-a-rfc/).
9. Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer [Gilles' PhD](https://github.com/glouppe/phd-thesis/blob/master/thesis.pdf) Section 4.4.4).
   
   This, _I think_, is conceptually similar to 4.

**NOTE:**
- 4, 7, 8, 9 are computationally intensive.
- 5 is not easy to do with our current API
- 3, 6 seem promising. I will implement 3 and see if I can extend that to 6 later
- Gilles' -1 were for 1, 2 (The rest were added later)
- [Ding and Simonoff's paper](http://people.stern.nyu.edu/jsimonof/jmlr10.pdf) which compares various methods and their relative accuracy is a good reference.

<hr>

Taken from [Ding and Simonoff's paper](http://people.stern.nyu.edu/jsimonof/jmlr10.pdf) the performance of various missing-value methods

<hr>

CC: @agramfort @GaelVaroquaux @glouppe @arjoly 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Missing values in RandomForest #5870

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Missing values in RandomForest #5870

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions