Skip to content

[RFC] Missing values in RandomForest #5870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
raghavrv opened this issue Nov 17, 2015 · 53 comments
Closed

[RFC] Missing values in RandomForest #5870

raghavrv opened this issue Nov 17, 2015 · 53 comments

Comments

@raghavrv
Copy link
Member

So far, we have been using our preprocessing.Imputer to take care of missing values before fitting the RandomForest. Ref: this example


Proposal: It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively...


Have a missing_values variable which will be either None (To raise error when we run into MVs) or int/nan value that will act as placeholder for missing_values.

Different ways in which missing values can be handled (I might have naively added duplicates) -

  1. (-1-1) Add an optional imputation variable, where we can either

    • Specify the strategy 'mean', 'median', 'most_frequent' (or missing_value?) and let the clf construct the Imputer on the fly...
    • Pass a built Imputer object (like we do for scoring or cv)
      This was the simplest approach. Variants are 5 and 6.

    Note that we can do this already using pipeline of imputer followed by the random forest.

  2. (-1+1) Ignore the missing values at the time of generating the splits.

  3. (+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).

    This is Gilles' suggestion. This is conceptually same as the "separate-class-method", where the missing values are considered as a separate categorical value. This is considered by Ding and Simonoff's paper to be the best method in different situations.

  4. As done in rpart, we could use sorrogate variables, where the strategy is to basically use the other features to decide the split, if one feature goes missing...

  5. Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref Ding and Simonoff's paper's paper.

    I think, this goes something along the lines of this example
    [1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] -
    * Split with available value is L--> [1, 2] R --> [3]
    * Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3

  6. Do imputation considering it as a supervised learning problem in itself, as done in MissForest. Build using available data --> Predict the missing values using this built model.

  7. Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer Gilles' PhD Section 4.4.4)

  8. Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer this airbnb blog.

  9. Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer Gilles' PhD Section 4.4.4).

    This, I think, is conceptually similar to 4.

NOTE:

  • 4, 7, 8, 9 are computationally intensive.
  • 5 is not easy to do with our current API
  • 3, 6 seem promising. I will implement 3 and see if I can extend that to 6 later
  • Gilles' -1 were for 1, 2 (The rest were added later)
  • Ding and Simonoff's paper which compares various methods and their relative accuracy is a good reference.

Taken from Ding and Simonoff's paper the performance of various missing-value methods


CC: @agramfort @GaelVaroquaux @glouppe @arjoly

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 17, 2015 via email

@raghavrv
Copy link
Member Author

Sorry for my lack of clarity... that is what I meant by the second approach... :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 17, 2015 via email

@glouppe
Copy link
Contributor

glouppe commented Nov 17, 2015

Indeed, @GaelVaroquaux is correct. There are built-in ways to handle missing values when building a tree. I am -1 for the proposed approaches.

In my opinion, a nice way of handling missing value is to find the best child to which the missing values should be put into. More specifically, a split would be composed of three pieces of information (instead of two as currently in master):

  • feature id
  • cut-off threshold
  • direction for missing values (i.e. left or right).

However, to implement this, the search of the best split has to be extended to handle missing values (by considering all splits when missing values are put left + all splits when missing values are put right), which is a non trivial change.

@glouppe
Copy link
Contributor

glouppe commented Nov 17, 2015

Am I wrong, but I thought that random forest (or trees in general) could
be made to naturally deal with missing data in one sample by ignoring
this missing data when doing the split criterion.

This would work indeed for finding a split, but still, this would not work at prediction time. You indeed need to decide where to go in case you encounter a split defined on a missing feature for a sample.

@lesshaste
Copy link

I don't know if it might be helpful to compare how rpart from R supports missing values https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf . An explicit (if perhaps not that up to date) discussion of how to handle missing values at prediction time can also be found at https://archive.nyu.edu/bitstream/2451/27813/2/CPP-06-07.pdf .

Slightly more up to date and still relevant is http://www.jmlr.org/papers/volume11/ding10a/ding10a.pdf .

@raghavrv
Copy link
Member Author

@lesshaste Thanks a lot for the links... I'll look into it :)

@jmschrei
Copy link
Member

A concern I have is how to handle missing values themselves. How do you handle missing values in a buffer of doubles?

@agramfort
Copy link
Member

np.nan ?

On 24 nov. 2015, at 21:31, Jacob Schreiber notifications@github.com wrote:

A concern I have is how to handle missing values themselves. How do you handle missing values in a buffer of doubles?


Reply to this email directly or view it on GitHub.

@jmschrei
Copy link
Member

How is np.nan represented at the buffer level, when the gil is released?

@GaelVaroquaux
Copy link
Member

np.nan ?

I don't think that it works as there are several ways a nan can be
represented as a floating point (that's the ieee standard):

In [5]: a = np.zeros(1)*np.nan
In [6]: a
Out[6]: array([ nan])
In [7]: str(a.data)
Out[7]: '\x00\x00\x00\x00\x00\x00\xf8\x7f'
In [8]: a = np.zeros(1) / 0
/usr/bin/ipython:1: RuntimeWarning: invalid value encountered in divide
  #! /usr/bin/python
In [9]: a
Out[9]: array([ nan])
In [10]: str(a.data)
Out[10]: '\x00\x00\x00\x00\x00\x00\xf8\xff'

Thus testing for nans is expensive.

@vene
Copy link
Member

vene commented Nov 24, 2015

So, boolean masks?

@raghavrv
Copy link
Member Author

how about np.inf?

@jmschrei
Copy link
Member

I think a boolean mask would have to be the way to go. Testing at the python level seems simpler than trying to test at the cython level. I'm concerned this will bog down the tree implementation, though.

@raghavrv
Copy link
Member Author

What is the drawback of checking for nan at cython level?

https://github.com/astropy/astropy/blob/master/astropy/convolution/boundary_extend.pyx#L13

@jmschrei
Copy link
Member

I suppose I'm wary about added a is_nan check at every comparison. I wasn't aware that you could pass along nan values in a type-casted buffer, but you should speed test that and see how it performs.

@jmschrei
Copy link
Member

Or, as an initial speed test, see what happens when you add that in without the rest of the tree functionality, just to see how much of a cost it incurs.

@raghavrv
Copy link
Member Author

see what happens when you add that in without the rest of the tree functionality, just to see how much of a cost it incurs.

That is a great idea... I'll raise a PR for this issue in a day or two (nopes I never do it as I say ;( ) and will add bench for that...

@glouppe
Copy link
Contributor

glouppe commented Nov 25, 2015

Not that, assuming the same interface as Imputer, NaN may not necessarily be the placeholder of missing values, but something that the user defined (e.g., -1)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 25, 2015 via email

@agramfort
Copy link
Member

agramfort commented Nov 25, 2015 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 25, 2015 via email

@agramfort
Copy link
Member

agramfort commented Nov 25, 2015 via email

@lesshaste
Copy link

http://pandas.pydata.org/pandas-docs/stable/missing_data.html discusses how pandas handles missing values from a user point of view. The page is very informative and starts with

"Note:
The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas. "

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 25, 2015 via email

@lesshaste
Copy link

@GaelVaroquaux It seems that pandas always uses IEEE 754 NaNs. The missing values for numeric types is always the float NaN. For boolean and integer types, introducing a NaN value will promote the column to float type. I think comparisons should be very fast in that case.

In C, testing if a float equals NaN is done using the isnan() function. You can alternatively just test ifx != x assuming you are using a C99 compliant compiler.

@raghavrv
Copy link
Member Author

there are many different nans possible in the IEEE standard

@GaelVaroquaux I think there are only 2 representations for NaN (sNaN for signal NaN for unexpected results from computations) which is your second example and qNaN for quiet NaN, your first example)...

If what I understand from below tests is correct, I don't think there will be a performance issue as np.isnan is much faster than np.isinf?

>>> nan_places = [(np.random.randint(1000), np.random.randint(5)) for i in range(100)]
>>> a = np.random.random_sample((1000, 5))
>>> a[nan_places] = np.inf
>>> %timeit np.isinf(a)
100000 loops, best of 3: 7.94 µs per loop
>>> %timeit np.isnan(a)
100000 loops, best of 3: 2.99 µs per loop
>>> a = np.random.random_sample((1000, 5))
>>> a[nan_places] = np.nan
>>> %timeit np.isinf(a)
100000 loops, best of 3: 7.93 µs per loop
>>> %timeit np.isnan(a)
100000 loops, best of 3: 3 µs per loop

@agramfort
Copy link
Member

agramfort commented Nov 28, 2015 via email

@PetterS
Copy link

PetterS commented Jul 18, 2017 via email

@Erotemic
Copy link
Contributor

@PetterS Ah, I misunderstood your first post. I thought you were replacing all nans in a column with either + or - inf. You're actually creating duplicating the feature dimension (and potentially doubling the dimensionality of the feature vector) and, as you said, simultaneously replacing nan with +inf in one (call it dimension A) dimension and -inf in the other (call it dimension B).

If you could enforce that when dimension sampling chooses B it also chooses A and vis-versa, I believe it would be equivalent to the separate class method. At each node the test would always have the choosing A as the feature dimension sending all nans to the right or choosing B and sending all nans to the left. Different choices could be made at different nodes, so I'm fairly convinced you are correct.

However, doubling the size of your feature vector is potentially a high price to pay. In addition you mentioned the problem that A and B will not always be available to a node simultaneously. So, I still think the nan checks at every node is the more appropriate implementation, but I would be interested in seeing these two implementations compared.

@franktoffel
Copy link

Is there any update on this open issue?

@Laksh1997
Copy link

Any update on this? This is a crucial feature imo

@Erotemic
Copy link
Contributor

Erotemic commented Dec 2, 2020

Yeah, I also believe this is important. I'm not longer working on the project that required this, but I'm still interested in seeing it added to sklearn for feature completeness.

@onacrame
Copy link

onacrame commented Jan 10, 2021

Agree with above. All the boosting libraries have the ability to handle nulls natively. This should be a feature in sklearn for 2021. There should be at least the option to handle them.

@thomasjpfan
Copy link
Member

Currently our boosting estimators: HistGradientBoostingRegressor and HistGradientBoostingClassifier can handle missing values natively.

@graham-pendragon
Copy link

Agree with folks above, this is a critical feature!

@kennethleungty
Copy link

Does anyone know what is the status of this?

@Erotemic
Copy link
Contributor

It's abandoned. But I would love to see it picked up again.

@mik3githubber
Copy link

Agree. Missing value handling has been available in gradient boosting library for years. LightGBMs random forest implementation has also allowed for missing values. Missing values should go to the node that reduces impurity etc the most.

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 26, 2022

I opened a PR #23595 as the first step to getting missing value support in random forest. The PR adds missing value support for decision trees and for splitter="best" and dense data to keep the PR smaller and easier to review. Once that PR is merged, follow up PRs will add missing value support for the random splitter, sparse data and ultimately enabling it for random forest.

@Permafacture
Copy link

So this would apply to AdaBoosted Trees and GradientBoosting models too? That's awesome

@simonprovost
Copy link

I opened a PR #23595 as the first step to getting missing value support in random forest. The PR adds missing value support for decision trees and for splitter="best" and dense data to keep the PR smaller and easier to review. Once that PR is merged, follow up PRs will add missing value support for the random splitter, sparse data and ultimately enabling it for random forest.

First and foremost, @thomasjpfan, a huge thumbs up for initiating this PR It's invigorating to see someone finally addressing an issue that has lingered in discussions for years. Your courage and initiative in taking the first step towards incorporating missing value support in scikit trees, starting with decision trees and splitter="best" for dense data, is highly commendable and much appreciated!

My professor and I have had extensive conversations about the implementation details, and we are both a little confused about certain aspects. Therefore, we definitely may be in error; hence, clarifications are greatly appreciated:

  1. Choice of Technique: Your chosen procedure appears to be consistent with Find the optimal split by sending the missing-valued samples to either side and selecting the direction that results in the greatest reduction in entropy (impurity) Could you explain the rationale or empirical or theoretical foundations that led to the selection of this method? Are there any particular papers or resources we could consult to better comprehend the reasoning and evidence supporting this decision?

  2. Efficiency Concerns: The decision to compute the information gain by combining values below plus missing ones and above the splitting point without missing ones raises questions regarding computational efficiency (two-groups based splitting decision). Despite the fact that I believe this approach to be more computationally efficient than other techniques previously discussed, I am curious as to whether this decision compared to others was predominantly motivated by computational efficiency that should in theory be increasing, yet not drastically compared to other techniques (e.g., imputation using supervised methods). In the meantime, has your implementation demonstrated significant training time increases thus far, or relatively fairly not that much?

Your clarifications would be extremely valuable, as they would provide a clearer picture and possibly foster a wider community consensus. This feature's absence from scikit-learn has been a long-standing deficiency as aforementioned per others, and while your contribution is a monumental (from my persona pov) step forward, a deeper comprehension of the implementation choices would aid in gaining broader support and confidence in the approach (more precisely in research direction) 🔬

Thank you so much for your dedication and the tremendous effort you've put into this initiative. Looking forward to your insights!

Cheers! 🚀

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 16, 2023 via email

@glemaitre
Copy link
Member

Closing this issue since the feature has been implemented.

@adam2392
Copy link
Member

adam2392 commented Jul 8, 2024

FYI #27966 will introduce native support for missing values in the ExtraTree* models (i.e. random splitter).

One thing I noticed though as I went through the PR is that the current codebase still does not support missing values in the sparse splitter. I think this might be pretty easy to add, but should we re-open this issue technically?

Xref: #5870 (comment)

@glemaitre
Copy link
Member

I think this might be pretty easy to add, but should we re-open this issue technically?

I like your optimism :)

I assume that we can open a new issue instead of reopening this old one where the discussion could be outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.