-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Missing values in RandomForest #5870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Am I wrong, but I thought that random forest (or trees in general) could
be made to naturally deal with missing data in one sample by ignoring
this missing data when doing the split criterion.
|
Sorry for my lack of clarity... that is what I meant by the second approach... :) |
Sorry for my lack of clarity... that is what I meant by the second approach...
:)
OK. That's the one that I am interested in.
|
Indeed, @GaelVaroquaux is correct. There are built-in ways to handle missing values when building a tree. I am -1 for the proposed approaches. In my opinion, a nice way of handling missing value is to find the best child to which the missing values should be put into. More specifically, a split would be composed of three pieces of information (instead of two as currently in master):
However, to implement this, the search of the best split has to be extended to handle missing values (by considering all splits when missing values are put left + all splits when missing values are put right), which is a non trivial change. |
This would work indeed for finding a split, but still, this would not work at prediction time. You indeed need to decide where to go in case you encounter a split defined on a missing feature for a sample. |
I don't know if it might be helpful to compare how Slightly more up to date and still relevant is http://www.jmlr.org/papers/volume11/ding10a/ding10a.pdf . |
@lesshaste Thanks a lot for the links... I'll look into it :) |
A concern I have is how to handle missing values themselves. How do you handle missing values in a buffer of doubles? |
np.nan ?
|
How is np.nan represented at the buffer level, when the gil is released? |
I don't think that it works as there are several ways a nan can be In [5]: a = np.zeros(1)*np.nan In [6]: a Out[6]: array([ nan]) In [7]: str(a.data) Out[7]: '\x00\x00\x00\x00\x00\x00\xf8\x7f' In [8]: a = np.zeros(1) / 0 /usr/bin/ipython:1: RuntimeWarning: invalid value encountered in divide #! /usr/bin/python In [9]: a Out[9]: array([ nan]) In [10]: str(a.data) Out[10]: '\x00\x00\x00\x00\x00\x00\xf8\xff' Thus testing for nans is expensive. |
So, boolean masks? |
how about |
I think a boolean mask would have to be the way to go. Testing at the python level seems simpler than trying to test at the cython level. I'm concerned this will bog down the tree implementation, though. |
What is the drawback of checking for nan at cython level? https://github.com/astropy/astropy/blob/master/astropy/convolution/boundary_extend.pyx#L13 |
I suppose I'm wary about added a is_nan check at every comparison. I wasn't aware that you could pass along nan values in a type-casted buffer, but you should speed test that and see how it performs. |
Or, as an initial speed test, see what happens when you add that in without the rest of the tree functionality, just to see how much of a cost it incurs. |
That is a great idea... I'll raise a PR for this issue |
Not that, assuming the same interface as |
What is the drawback of checking for nan at cython level?
It's probably slow at the CPU level.
|
ok but nan is the standard way of coding for missing values. How about
handling it internally? eventually with a data copy. User passes nan and
then in the fit the nan are replaced with specific code that is fast while
building the tree?
|
ok but nan is the standard way of coding for missing values. How about
handling it internally?
What do you mean standard? Missing data are in no recognized standard (eg
IEEE). It would be great to get them there, but currently there is no
standard, and that's what we fight with. Nan encoding numerical errors,
not missing data. R has a specific nan to encode missing data (there are
many different nans possible in the IEEE standard).
eventually with a data copy. User passes nan and then in the fit the
nan are replaced with specific code that is fast while building the
tree?
That's OK with me. I would even say that as in the Imputer, we should
have an argument called "missing_values", that can be none (no
imputation), or a value, or "NaN". Internally we transform it into
something more convenient.
|
ok with me.
how does pandas handle it?
|
http://pandas.pydata.org/pandas-docs/stable/missing_data.html discusses how pandas handles missing values from a user point of view. The page is very informative and starts with "Note: |
The most important question is: how to they do is internal. AFAIK,
testing for nan is much more costly than testing for equality. Thus for
an implementation like trees, which is very tuned for speed, we might
want to avoid that.
|
@GaelVaroquaux It seems that pandas always uses IEEE 754 NaNs. The missing values for numeric types is always the float NaN. For boolean and integer types, introducing a NaN value will promote the column to float type. I think comparisons should be very fast in that case. In C, testing if a float equals NaN is done using the isnan() function. You can alternatively just test if |
@GaelVaroquaux I think there are only 2 representations for NaN (sNaN for signal NaN for unexpected results from computations) which is your second example and qNaN for quiet NaN, your first example)... If what I understand from below tests is correct, I don't think there will be a performance issue as >>> nan_places = [(np.random.randint(1000), np.random.randint(5)) for i in range(100)]
>>> a = np.random.random_sample((1000, 5))
>>> a[nan_places] = np.inf
>>> %timeit np.isinf(a)
100000 loops, best of 3: 7.94 µs per loop
>>> %timeit np.isnan(a)
100000 loops, best of 3: 2.99 µs per loop
>>> a = np.random.random_sample((1000, 5))
>>> a[nan_places] = np.nan
>>> %timeit np.isinf(a)
100000 loops, best of 3: 7.93 µs per loop
>>> %timeit np.isnan(a)
100000 loops, best of 3: 3 µs per loop |
have a look at:
scikit-learn-contrib/py-earth#88
|
Well, it's a way of implementing #3 that I quoted, so I believe it is
studied in their paper.
…On Tue, 18 Jul 2017, 16:25 Jon Crall, ***@***.***> wrote:
@PetterS <https://github.com/petters> From a theoretical perspective
globally picking +/- inf seems to be an ad-hoc approach to addressing the
issue of missing data. While it is not tested in Ding and Simonoff's paper,
my intuition says it would not be competitive with the strategies they
evaluate.
@raghavrv <https://github.com/raghavrv> I have some real data that
contains missing values that I'm sending to a RandomForestClassifier that
uses your implementation of the separate class method. Is there an
experiment I could do to help show this method is useful?
For more details on my data and objective: The problem is individual
animal verification. The goal is to classify a pair of images (cropped to a
single animal) in one of three categories: the same *individual* animal,
different animals, or if there is not enough information to tell. The
feature vector I'm using is constructed to represent a pair of images. Part
of this feature is derived from the image EXIF data, which is often missing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5870 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAvI0VfGL2P83wtC8VaOmNxba1_YX8Bjks5sPMBxgaJpZM4GkBxh>
.
|
@PetterS Ah, I misunderstood your first post. I thought you were replacing all nans in a column with either + or - inf. You're actually creating duplicating the feature dimension (and potentially doubling the dimensionality of the feature vector) and, as you said, simultaneously replacing nan with +inf in one (call it dimension A) dimension and -inf in the other (call it dimension B). If you could enforce that when dimension sampling chooses B it also chooses A and vis-versa, I believe it would be equivalent to the separate class method. At each node the test would always have the choosing A as the feature dimension sending all nans to the right or choosing B and sending all nans to the left. Different choices could be made at different nodes, so I'm fairly convinced you are correct. However, doubling the size of your feature vector is potentially a high price to pay. In addition you mentioned the problem that A and B will not always be available to a node simultaneously. So, I still think the nan checks at every node is the more appropriate implementation, but I would be interested in seeing these two implementations compared. |
Is there any update on this open issue? |
Any update on this? This is a crucial feature imo |
Yeah, I also believe this is important. I'm not longer working on the project that required this, but I'm still interested in seeing it added to sklearn for feature completeness. |
Agree with above. All the boosting libraries have the ability to handle nulls natively. This should be a feature in sklearn for 2021. There should be at least the option to handle them. |
Currently our boosting estimators: |
Agree with folks above, this is a critical feature! |
Does anyone know what is the status of this? |
It's abandoned. But I would love to see it picked up again. |
Agree. Missing value handling has been available in gradient boosting library for years. LightGBMs random forest implementation has also allowed for missing values. Missing values should go to the node that reduces impurity etc the most. |
I opened a PR #23595 as the first step to getting missing value support in random forest. The PR adds missing value support for decision trees and for |
So this would apply to AdaBoosted Trees and GradientBoosting models too? That's awesome |
First and foremost, @thomasjpfan, a huge thumbs up for initiating this PR It's invigorating to see someone finally addressing an issue that has lingered in discussions for years. Your courage and initiative in taking the first step towards incorporating missing value support in scikit trees, starting with decision trees and My professor and I have had extensive conversations about the implementation details, and we are both a little confused about certain aspects. Therefore, we definitely may be in error; hence, clarifications are greatly appreciated:
Your clarifications would be extremely valuable, as they would provide a clearer picture and possibly foster a wider community consensus. This feature's absence from scikit-learn has been a long-standing deficiency as aforementioned per others, and while your contribution is a monumental (from my persona pov) step forward, a deeper comprehension of the implementation choices would aid in gaining broader support and confidence in the approach (more precisely in research direction) 🔬 Thank you so much for your dedication and the tremendous effort you've put into this initiative. Looking forward to your insights! Cheers! 🚀 |
1. Choice of Technique: Your chosen procedure appears to be consistent with Find the optimal split by sending the missing-valued samples to either side and selecting the direction that results in the greatest reduction in entropy (impurity) Could you explain the rationale or empirical or theoretical foundations that led to the selection of this method? Are there any particular papers or resources we could consult to better comprehend the reasoning and evidence supporting this decision?
This is the approach used by XGBoost.
In terms of theoretical analysis, you can find one in sec 5 of https://arxiv.org/abs/1902.06931 and sec 6 compares empirically multiple split strategy.
Another empirical study of missing values, more extensive, shows the benefit of handling in trees: https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac013/6568998
|
Closing this issue since the feature has been implemented. |
FYI #27966 will introduce native support for missing values in the One thing I noticed though as I went through the PR is that the current codebase still does not support missing values in the sparse splitter. I think this might be pretty easy to add, but should we re-open this issue technically? Xref: #5870 (comment) |
I like your optimism :) I assume that we can open a new issue instead of reopening this old one where the discussion could be outdated. |
So far, we have been using our
preprocessing.Imputer
to take care of missing values before fitting theRandomForest
. Ref: this exampleProposal: It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively...
Have a
missing_values
variable which will be eitherNone
(To raise error when we run into MVs) or int/nan value that will act as placeholder formissing_values
.Different ways in which missing values can be handled (I might have naively added duplicates) -
(-1-1) Add an optional
imputation
variable, where we can either'mean'
,'median'
,'most_frequent'
(ormissing_value
?) and let the clf construct theImputer
on the fly...Imputer
object (like we do forscoring
orcv
)This was the simplest approach. Variants are 5 and 6.
Note that we can do this already using pipeline of imputer followed by the random forest.
(-1+1) Ignore the missing values at the time of generating the splits.
(+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).
This is Gilles' suggestion. This is conceptually same as the "separate-class-method", where the missing values are considered as a separate categorical value. This is considered by Ding and Simonoff's paper to be the best method in different situations.
As done in
rpart
, we could use sorrogate variables, where the strategy is to basically use the other features to decide the split, if one feature goes missing...Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref Ding and Simonoff's paper's paper.
I think, this goes something along the lines of this example
[1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] -
* Split with available value is L--> [1, 2] R --> [3]
* Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3
Do imputation considering it as a supervised learning problem in itself, as done in MissForest. Build using available data --> Predict the missing values using this built model.
Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer Gilles' PhD Section 4.4.4)
Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer this airbnb blog.
Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer Gilles' PhD Section 4.4.4).
This, I think, is conceptually similar to 4.
NOTE:
Taken from Ding and Simonoff's paper the performance of various missing-value methods
CC: @agramfort @GaelVaroquaux @glouppe @arjoly
The text was updated successfully, but these errors were encountered: