[RFC] Missing values in RandomForest #5870

raghavrv · 2015-11-17T17:34:51Z

So far, we have been using our preprocessing.Imputer to take care of missing values before fitting the RandomForest. Ref: this example

Proposal: It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively...

Have a missing_values variable which will be either None (To raise error when we run into MVs) or int/nan value that will act as placeholder for missing_values.

Different ways in which missing values can be handled (I might have naively added duplicates) -

(-1-1) Add an optional imputation variable, where we can either
- Specify the strategy 'mean', 'median', 'most_frequent' (or missing_value?) and let the clf construct the Imputer on the fly...
- Pass a built Imputer object (like we do for scoring or cv)
  This was the simplest approach. Variants are 5 and 6.
Note that we can do this already using pipeline of imputer followed by the random forest.
(-1+1) Ignore the missing values at the time of generating the splits.
(+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).

This is Gilles' suggestion. This is conceptually same as the "separate-class-method", where the missing values are considered as a separate categorical value. This is considered by Ding and Simonoff's paper to be the best method in different situations.
As done in rpart, we could use sorrogate variables, where the strategy is to basically use the other features to decide the split, if one feature goes missing...
Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref Ding and Simonoff's paper's paper.

I think, this goes something along the lines of this example
[1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] -
* Split with available value is L--> [1, 2] R --> [3]
* Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3
Do imputation considering it as a supervised learning problem in itself, as done in MissForest. Build using available data --> Predict the missing values using this built model.
Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer Gilles' PhD Section 4.4.4)
Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer this airbnb blog.
Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer Gilles' PhD Section 4.4.4).

This, I think, is conceptually similar to 4.

NOTE:

4, 7, 8, 9 are computationally intensive.
5 is not easy to do with our current API
3, 6 seem promising. I will implement 3 and see if I can extend that to 6 later
Gilles' -1 were for 1, 2 (The rest were added later)
Ding and Simonoff's paper which compares various methods and their relative accuracy is a good reference.

Taken from Ding and Simonoff's paper the performance of various missing-value methods

CC: @agramfort @GaelVaroquaux @glouppe @arjoly

The text was updated successfully, but these errors were encountered:

GaelVaroquaux · 2015-11-17T17:37:46Z

Am I wrong, but I thought that random forest (or trees in general) could be made to naturally deal with missing data in one sample by ignoring this missing data when doing the split criterion.

raghavrv · 2015-11-17T17:42:34Z

Sorry for my lack of clarity... that is what I meant by the second approach... :)

GaelVaroquaux · 2015-11-17T17:43:04Z

Sorry for my lack of clarity... that is what I meant by the second approach... :)

OK. That's the one that I am interested in.

glouppe · 2015-11-17T17:47:21Z

Indeed, @GaelVaroquaux is correct. There are built-in ways to handle missing values when building a tree. I am -1 for the proposed approaches.

In my opinion, a nice way of handling missing value is to find the best child to which the missing values should be put into. More specifically, a split would be composed of three pieces of information (instead of two as currently in master):

feature id
cut-off threshold
direction for missing values (i.e. left or right).

However, to implement this, the search of the best split has to be extended to handle missing values (by considering all splits when missing values are put left + all splits when missing values are put right), which is a non trivial change.

glouppe · 2015-11-17T17:58:48Z

Am I wrong, but I thought that random forest (or trees in general) could
be made to naturally deal with missing data in one sample by ignoring
this missing data when doing the split criterion.

This would work indeed for finding a split, but still, this would not work at prediction time. You indeed need to decide where to go in case you encounter a split defined on a missing feature for a sample.

lesshaste · 2015-11-18T13:52:15Z

I don't know if it might be helpful to compare how rpart from R supports missing values https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf . An explicit (if perhaps not that up to date) discussion of how to handle missing values at prediction time can also be found at https://archive.nyu.edu/bitstream/2451/27813/2/CPP-06-07.pdf .

Slightly more up to date and still relevant is http://www.jmlr.org/papers/volume11/ding10a/ding10a.pdf .

raghavrv · 2015-11-18T14:29:37Z

@lesshaste Thanks a lot for the links... I'll look into it :)

jmschrei · 2015-11-24T20:31:42Z

A concern I have is how to handle missing values themselves. How do you handle missing values in a buffer of doubles?

agramfort · 2015-11-24T21:03:24Z

np.nan ?

On 24 nov. 2015, at 21:31, Jacob Schreiber notifications@github.com wrote:

A concern I have is how to handle missing values themselves. How do you handle missing values in a buffer of doubles?

—
Reply to this email directly or view it on GitHub.

jmschrei · 2015-11-24T21:21:23Z

How is np.nan represented at the buffer level, when the gil is released?

GaelVaroquaux · 2015-11-24T21:25:42Z

np.nan ?

I don't think that it works as there are several ways a nan can be
represented as a floating point (that's the ieee standard):

In [5]: a = np.zeros(1)*np.nan
In [6]: a
Out[6]: array([ nan])
In [7]: str(a.data)
Out[7]: '\x00\x00\x00\x00\x00\x00\xf8\x7f'
In [8]: a = np.zeros(1) / 0
/usr/bin/ipython:1: RuntimeWarning: invalid value encountered in divide
  #! /usr/bin/python
In [9]: a
Out[9]: array([ nan])
In [10]: str(a.data)
Out[10]: '\x00\x00\x00\x00\x00\x00\xf8\xff'

Thus testing for nans is expensive.

vene · 2015-11-24T23:09:28Z

So, boolean masks?

raghavrv · 2015-11-24T23:30:37Z

how about np.inf?

jmschrei · 2015-11-24T23:59:07Z

I think a boolean mask would have to be the way to go. Testing at the python level seems simpler than trying to test at the cython level. I'm concerned this will bog down the tree implementation, though.

raghavrv · 2015-11-25T00:39:00Z

What is the drawback of checking for nan at cython level?

https://github.com/astropy/astropy/blob/master/astropy/convolution/boundary_extend.pyx#L13

jmschrei · 2015-11-25T00:40:47Z

I suppose I'm wary about added a is_nan check at every comparison. I wasn't aware that you could pass along nan values in a type-casted buffer, but you should speed test that and see how it performs.

jmschrei · 2015-11-25T00:41:16Z

Or, as an initial speed test, see what happens when you add that in without the rest of the tree functionality, just to see how much of a cost it incurs.

raghavrv · 2015-11-25T03:25:40Z

see what happens when you add that in without the rest of the tree functionality, just to see how much of a cost it incurs.

That is a great idea... I'll raise a PR for this issue ~~in a day or two~~ (nopes I never do it as I say ;( ) and will add bench for that...

glouppe · 2015-11-25T06:36:31Z

Not that, assuming the same interface as Imputer, NaN may not necessarily be the placeholder of missing values, but something that the user defined (e.g., -1)

GaelVaroquaux · 2015-11-25T06:38:15Z

What is the drawback of checking for nan at cython level?

It's probably slow at the CPU level.

agramfort · 2015-11-25T09:14:05Z

ok but nan is the standard way of coding for missing values. How about handling it internally? eventually with a data copy. User passes nan and then in the fit the nan are replaced with specific code that is fast while building the tree?

GaelVaroquaux · 2015-11-25T09:30:21Z

ok but nan is the standard way of coding for missing values. How about handling it internally?

What do you mean standard? Missing data are in no recognized standard (eg IEEE). It would be great to get them there, but currently there is no standard, and that's what we fight with. Nan encoding numerical errors, not missing data. R has a specific nan to encode missing data (there are many different nans possible in the IEEE standard).

eventually with a data copy. User passes nan and then in the fit the nan are replaced with specific code that is fast while building the tree?

That's OK with me. I would even say that as in the Imputer, we should have an argument called "missing_values", that can be none (no imputation), or a value, or "NaN". Internally we transform it into something more convenient.

agramfort · 2015-11-25T09:34:33Z

ok with me. how does pandas handle it?

lesshaste · 2015-11-25T09:47:18Z

http://pandas.pydata.org/pandas-docs/stable/missing_data.html discusses how pandas handles missing values from a user point of view. The page is very informative and starts with

"Note:
The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. It differs from the MaskedArray approach of, for example, scikits.timeseries. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas. "

GaelVaroquaux · 2015-11-25T09:49:07Z

http://pandas.pydata.org/pandas-docs/stable/missing_data.html

The most important question is: how to they do is internal. AFAIK, testing for nan is much more costly than testing for equality. Thus for an implementation like trees, which is very tuned for speed, we might want to avoid that.

lesshaste · 2015-11-25T10:57:50Z

@GaelVaroquaux It seems that pandas always uses IEEE 754 NaNs. The missing values for numeric types is always the float NaN. For boolean and integer types, introducing a NaN value will promote the column to float type. I think comparisons should be very fast in that case.

In C, testing if a float equals NaN is done using the isnan() function. You can alternatively just test ifx != x assuming you are using a C99 compliant compiler.

raghavrv · 2015-11-25T18:17:08Z

there are many different nans possible in the IEEE standard

@GaelVaroquaux I think there are only 2 representations for NaN (sNaN for signal NaN for unexpected results from computations) which is your second example and qNaN for quiet NaN, your first example)...

If what I understand from below tests is correct, I don't think there will be a performance issue as np.isnan is much faster than np.isinf?

>>> nan_places = [(np.random.randint(1000), np.random.randint(5)) for i in range(100)]
>>> a = np.random.random_sample((1000, 5))
>>> a[nan_places] = np.inf
>>> %timeit np.isinf(a)
100000 loops, best of 3: 7.94 µs per loop
>>> %timeit np.isnan(a)
100000 loops, best of 3: 2.99 µs per loop
>>> a = np.random.random_sample((1000, 5))
>>> a[nan_places] = np.nan
>>> %timeit np.isinf(a)
100000 loops, best of 3: 7.93 µs per loop
>>> %timeit np.isnan(a)
100000 loops, best of 3: 3 µs per loop

agramfort · 2015-11-28T08:10:37Z

have a look at: scikit-learn-contrib/py-earth#88

PetterS · 2017-07-18T15:10:30Z

Well, it's a way of implementing #3 that I quoted, so I believe it is studied in their paper.

…

On Tue, 18 Jul 2017, 16:25 Jon Crall, ***@***.***> wrote: @PetterS <https://github.com/petters> From a theoretical perspective globally picking +/- inf seems to be an ad-hoc approach to addressing the issue of missing data. While it is not tested in Ding and Simonoff's paper, my intuition says it would not be competitive with the strategies they evaluate. @raghavrv <https://github.com/raghavrv> I have some real data that contains missing values that I'm sending to a RandomForestClassifier that uses your implementation of the separate class method. Is there an experiment I could do to help show this method is useful? For more details on my data and objective: The problem is individual animal verification. The goal is to classify a pair of images (cropped to a single animal) in one of three categories: the same *individual* animal, different animals, or if there is not enough information to tell. The feature vector I'm using is constructed to represent a pair of images. Part of this feature is derived from the image EXIF data, which is often missing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5870 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAvI0VfGL2P83wtC8VaOmNxba1_YX8Bjks5sPMBxgaJpZM4GkBxh> .

Erotemic · 2017-07-18T20:00:40Z

@PetterS Ah, I misunderstood your first post. I thought you were replacing all nans in a column with either + or - inf. You're actually creating duplicating the feature dimension (and potentially doubling the dimensionality of the feature vector) and, as you said, simultaneously replacing nan with +inf in one (call it dimension A) dimension and -inf in the other (call it dimension B).

If you could enforce that when dimension sampling chooses B it also chooses A and vis-versa, I believe it would be equivalent to the separate class method. At each node the test would always have the choosing A as the feature dimension sending all nans to the right or choosing B and sending all nans to the left. Different choices could be made at different nodes, so I'm fairly convinced you are correct.

However, doubling the size of your feature vector is potentially a high price to pay. In addition you mentioned the problem that A and B will not always be available to a node simultaneously. So, I still think the nan checks at every node is the more appropriate implementation, but I would be interested in seeing these two implementations compared.

franktoffel · 2020-07-26T08:02:39Z

Is there any update on this open issue?

Laksh1997 · 2020-12-02T00:45:59Z

Any update on this? This is a crucial feature imo

Erotemic · 2020-12-02T01:06:08Z

Yeah, I also believe this is important. I'm not longer working on the project that required this, but I'm still interested in seeing it added to sklearn for feature completeness.

onacrame · 2021-01-10T09:46:05Z

Agree with above. All the boosting libraries have the ability to handle nulls natively. This should be a feature in sklearn for 2021. There should be at least the option to handle them.

thomasjpfan · 2021-01-12T02:29:22Z

Currently our boosting estimators: HistGradientBoostingRegressor and HistGradientBoostingClassifier can handle missing values natively.

graham-pendragon · 2021-08-18T11:56:28Z

Agree with folks above, this is a critical feature!

kennethleungty · 2022-06-26T14:02:23Z

Does anyone know what is the status of this?

Erotemic · 2022-06-26T15:55:56Z

It's abandoned. But I would love to see it picked up again.

mik3githubber · 2022-06-26T16:03:33Z

Agree. Missing value handling has been available in gradient boosting library for years. LightGBMs random forest implementation has also allowed for missing values. Missing values should go to the node that reduces impurity etc the most.

thomasjpfan · 2022-06-26T16:34:57Z

I opened a PR #23595 as the first step to getting missing value support in random forest. The PR adds missing value support for decision trees and for splitter="best" and dense data to keep the PR smaller and easier to review. Once that PR is merged, follow up PRs will add missing value support for the random splitter, sparse data and ultimately enabling it for random forest.

Permafacture · 2022-06-27T16:50:34Z

So this would apply to AdaBoosted Trees and GradientBoosting models too? That's awesome

simonprovost · 2023-10-15T21:41:53Z

I opened a PR #23595 as the first step to getting missing value support in random forest. The PR adds missing value support for decision trees and for splitter="best" and dense data to keep the PR smaller and easier to review. Once that PR is merged, follow up PRs will add missing value support for the random splitter, sparse data and ultimately enabling it for random forest.

First and foremost, @thomasjpfan, a huge thumbs up for initiating this PR It's invigorating to see someone finally addressing an issue that has lingered in discussions for years. Your courage and initiative in taking the first step towards incorporating missing value support in scikit trees, starting with decision trees and splitter="best" for dense data, is highly commendable and much appreciated!

My professor and I have had extensive conversations about the implementation details, and we are both a little confused about certain aspects. Therefore, we definitely may be in error; hence, clarifications are greatly appreciated:

Choice of Technique: Your chosen procedure appears to be consistent with Find the optimal split by sending the missing-valued samples to either side and selecting the direction that results in the greatest reduction in entropy (impurity) Could you explain the rationale or empirical or theoretical foundations that led to the selection of this method? Are there any particular papers or resources we could consult to better comprehend the reasoning and evidence supporting this decision?
Efficiency Concerns: The decision to compute the information gain by combining values below plus missing ones and above the splitting point without missing ones raises questions regarding computational efficiency (two-groups based splitting decision). Despite the fact that I believe this approach to be more computationally efficient than other techniques previously discussed, I am curious as to whether this decision compared to others was predominantly motivated by computational efficiency that should in theory be increasing, yet not drastically compared to other techniques (e.g., imputation using supervised methods). In the meantime, has your implementation demonstrated significant training time increases thus far, or relatively fairly not that much?

Your clarifications would be extremely valuable, as they would provide a clearer picture and possibly foster a wider community consensus. This feature's absence from scikit-learn has been a long-standing deficiency as aforementioned per others, and while your contribution is a monumental (from my persona pov) step forward, a deeper comprehension of the implementation choices would aid in gaining broader support and confidence in the approach (more precisely in research direction) 🔬

Thank you so much for your dedication and the tremendous effort you've put into this initiative. Looking forward to your insights!

Cheers! 🚀

GaelVaroquaux · 2023-10-16T07:15:33Z

1. Choice of Technique: Your chosen procedure appears to be consistent with Find the optimal split by sending the missing-valued samples to either side and selecting the direction that results in the greatest reduction in entropy (impurity) Could you explain the rationale or empirical or theoretical foundations that led to the selection of this method? Are there any particular papers or resources we could consult to better comprehend the reasoning and evidence supporting this decision?

This is the approach used by XGBoost. In terms of theoretical analysis, you can find one in sec 5 of https://arxiv.org/abs/1902.06931 and sec 6 compares empirically multiple split strategy. Another empirical study of missing values, more extensive, shows the benefit of handling in trees: https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac013/6568998

glemaitre · 2023-11-08T13:45:21Z

Closing this issue since the feature has been implemented.

adam2392 · 2024-07-08T00:47:57Z

FYI #27966 will introduce native support for missing values in the ExtraTree* models (i.e. random splitter).

One thing I noticed though as I went through the PR is that the current codebase still does not support missing values in the sparse splitter. I think this might be pretty easy to add, but should we re-open this issue technically?

Xref: #5870 (comment)

glemaitre · 2024-07-22T12:17:56Z

I think this might be pretty easy to add, but should we re-open this issue technically?

I like your optimism :)

I assume that we can open a new issue instead of reopening this old one where the discussion could be outdated.

raghavrv mentioned this issue Nov 19, 2015

[RFC] Tree module improvements #5212

Open

12 tasks

raghavrv mentioned this issue Dec 7, 2015

[MRG] ENH Add support for missing values to Tree based Classifiers #5974

Closed

18 tasks

thomasjpfan self-assigned this Aug 18, 2021

cmarmo added module:ensemble module:impute module:tree help wanted New Feature and removed module:impute labels Dec 9, 2021

thomasjpfan mentioned this issue Jun 12, 2022

ENH Adds missing value support for trees #23595

Merged

glemaitre closed this as completed Nov 8, 2023

GaelVaroquaux mentioned this issue Dec 19, 2023

Handle numerical missing values in TableVectorizer skrub-data/skrub#871

Closed

CompRhys mentioned this issue Jan 3, 2024

[RFC] Allow handling of NaNs in multi-task Random Forests #27993

Open

adam2392 mentioned this issue Jul 22, 2024

FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542

Open

[RFC] Missing values in RandomForest #5870

[RFC] Missing values in RandomForest #5870

Comments

raghavrv commented Nov 17, 2015

GaelVaroquaux commented Nov 17, 2015 via email

raghavrv commented Nov 17, 2015

GaelVaroquaux commented Nov 17, 2015 via email

glouppe commented Nov 17, 2015

glouppe commented Nov 17, 2015

lesshaste commented Nov 18, 2015

raghavrv commented Nov 18, 2015

jmschrei commented Nov 24, 2015

agramfort commented Nov 24, 2015

jmschrei commented Nov 24, 2015

GaelVaroquaux commented Nov 24, 2015

vene commented Nov 24, 2015

raghavrv commented Nov 24, 2015

jmschrei commented Nov 24, 2015

raghavrv commented Nov 25, 2015

jmschrei commented Nov 25, 2015

jmschrei commented Nov 25, 2015

raghavrv commented Nov 25, 2015

glouppe commented Nov 25, 2015

GaelVaroquaux commented Nov 25, 2015 via email

agramfort commented Nov 25, 2015 via email

GaelVaroquaux commented Nov 25, 2015 via email

agramfort commented Nov 25, 2015 via email

lesshaste commented Nov 25, 2015

GaelVaroquaux commented Nov 25, 2015 via email

lesshaste commented Nov 25, 2015

raghavrv commented Nov 25, 2015

agramfort commented Nov 28, 2015 via email

PetterS commented Jul 18, 2017 via email

Erotemic commented Jul 18, 2017

franktoffel commented Jul 26, 2020

Laksh1997 commented Dec 2, 2020

Erotemic commented Dec 2, 2020

onacrame commented Jan 10, 2021 • edited Loading

thomasjpfan commented Jan 12, 2021

graham-pendragon commented Aug 18, 2021

kennethleungty commented Jun 26, 2022

Erotemic commented Jun 26, 2022

mik3githubber commented Jun 26, 2022

thomasjpfan commented Jun 26, 2022 • edited Loading

Permafacture commented Jun 27, 2022

simonprovost commented Oct 15, 2023

GaelVaroquaux commented Oct 16, 2023 via email

glemaitre commented Nov 8, 2023

adam2392 commented Jul 8, 2024

glemaitre commented Jul 22, 2024

onacrame commented Jan 10, 2021 •

edited

Loading

thomasjpfan commented Jun 26, 2022 •

edited

Loading