-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Balanced Random Forest #5181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balanced Random Forest #5181
Conversation
The linked paper above has several examples. See tables 3-7 for comparison of performance including Balanced Random Forest (BRF) and Weighted Random Forest (WRF), which is class_weight='auto'. The authors (n.b. Brieman is the creator of Random Forests) conclude:
In my experience another benefit of WRF is improved precision at the top. I.e. take the top N predictions (as determined by their predicted probability) and assume them to be True, and then calculate precision. This is a common metric for resource allocation problems. BRF does very well here. Unfortunately my own examples are not using public data. Would you like me to build a public example using sklearn.datasets or is the above paper sufficient? |
It would be great to have an example using a public dataset. |
+1 for an example. It would be great to include fit times for WRF vs BRF (to emphasize the computational advantage). Could you give an example to explain the resource allocation / precision at the top case? I am not sure I understand. Maybe you could include that in your example as well. For the example it would be great to have it run fast enough (e.g. less than 60s, ideally less than 10s). |
Hi all, sorry for the delay. I've added an example in examples/ensemble/balanced_random_forest.py. The dataset is KDD Cup '99:
The original data has a dozen or so classes but I've squashed all of the minorities into one class which makes up 1.7% of the training set. The script downloads the compressed example data (3mb). "Precision at k" is defined as follows: given a list of examples and a class of interest we look at the predicted probabilities. We take the k highest probabilities and mark them as members of the class and then compute the precision of this prediction. This metric is useful when one has limited resources to investigate a particular class. For example if we are predicting a disease and we only have 1000 vaccines, precision at 1000 would tell us what proportion of the vaccines we distribute according to our model actually went to good use. Examples of this type are abundant. Anyway, my example will print something like:
In this case I have chosen k to be the number of positive cases in the test set and have printed out both precision at k as well as auc. As you can see the existing 'balanced subsample' as well as my feature 'balanced' (in this particular run the former did better but generally they are similar) do much better than either the default or class_weight='auto'. Additionally balanced random forest runs in 1/5 of the time of the others. Thoughts? If there is interest in the merging, I would suggest:
|
I'm not sure how I feel about modifying the dataset after it's passed to a classifier, especially when this is the major upgrade. Is it possible that a better approach would be to have a balanced dataset transformer which could be fed into the existing random forest code for the same performance? |
It's not "modifying" the dataset any more than bootstrap sampling does in the random forest already. It's just a variation on bootstrap to ensure that classes are equally represented. Unfortunately it's not possible to achieve this by passing a transformed dataset into RF. |
This looks interesting. Maybe @trevorstephens might want to comment about this pull request. |
This does look very interesting. @glouppe any opinions? It would be nice to have a couple more datasets for comparison. |
Thx for the ping @arjoly . I'll try to take a deeper look soon. But some initial thoughts... Can this method easily be modified to support multi-output? Looks like only single output is currently coded in. At first thought, you could duplicate rows that are chosen for both outputs, or use the union of any generated subsets. Doubtful there's much literature covering such cases, but we do need to do something to support it I suppose. Can you explain the rationale behind the "precision at 28318"? It would be nice to see a threshold/precision plot with each method overlaid on it I think. I am very surprised by the difference between "auto" (I think we're using "balanced" now btw) and "balanced subsample". In my trials way back when, I saw very little difference between the two. How stable are these results with different random seeds? I generally don't use the pre-written "preset" class weights in practice, but grid search for a better weighting scheme. Thus, I think it would be good to also compare this method to grid searching over a range of manually set class_weights. Will take a look at the code and example more later. |
http://scikit-learn.org/stable/modules/tree.html#multi-output-problems Basically, y is 2D and you predict multiple targets at once. It is supported by individual trees and random forests. |
I ran your example code this evening (the difference between the auto and balanced_subsample weights really surprised me), and this particular problem appears to be extremely unstable on a mere 100 trees. Over 20 trials:
Where the reported results are mean +/- stdev and the time taken is in seconds for all 20 trials to complete. I do think there's value here. But the metric results don't appear to be as significant as your results above, at least on this problem. The time is indeed an improvement, but might potentially be worse than the existing solutions on more balanced and/or larger datasets due to recreating the dataset at each tree. As another thought on API, this could probably be incorporated into the existing Some solution for multi-output behavior is still required as I mentioned. |
Above results are with no random_state set, so just different seeds on each trial BTW |
Coming a bit late to the party... Am I correctly understanding that this is effectively implementing class balancing (in terms of number of samples) with bootstrap for each class? It should be very similar to In any case, given this was proposed by Breiman, I am not against adding such a balancing strategy. |
I would prefer something like this too! |
Hmm actually one of the reasons I believe this PR behaves differently than class_weight="auto" + bootstrap=True is because bootstrap replicates are drawn in master without taking sample weights into account. I think we should fix that, but this will not be backward compatible... :/ @arjoly @trevorstephens what do you think? |
Arf, could it be consider as a bugfix? There are probably weird situation where the current behavior is sub-optimal. (For instance in the presence of mostly zero sample weight?) |
@glouppe @arjoly The 'balanced_subsample' option calculates class weights after the bootstrap is drawn, so this is already an option to users for the "presets". That does not take into account any In terms of differences between under/over sampling, and this is purely arm-waving, from deep trees I've plotted with massive imbalance and heavy class weighting, the ones I've grown tend to go way off to one side, shedding negative cases on the way down until right at the bottom you start to get positive classifications. This PR's method of undersampling the majority class may create more balanced trees, it probably has better regularization at first thought, but you're throwing away a ton of information for each tree. I do think it's a good option to have available to try each way. It would be pretty easy to create a bootstrap from the weighted samples, just draw floats (up to the sum of weights) instead of ints. Might be more expensive on the searchsorted though and would have to be done for each tree. For deprecating, maybe a temporary |
@glouppe @arjoly I agree with @trevorstephens that this should not be @trevorstephens while it would be nice to grid search through 'auto', @trevorstephens regarding multi-output, I do not see a great way of handling @trevorstephens Yes, the output is unstable with just 100 trees. I chose that I have also changed the example to, as requested, generate a precision chart And here is the printed output:
I also moved the example to it's own repository that includes I would love to get this merged soon because I am getting busy with other projects which are causing my responses to get increasingly delayed. (Thanks for your patience.) I think the only remaining item is the API. What are the thoughts on that? |
I think that this is a really interesting proposition. Training time aside, in my work I do a lot of highly imbalanced classification stuffs, and alternative strategies are very welcome to explore different solutions. Support for multi-output should be incorporated though, even if it is sub-optimal. Having some parameter options available and other absent based on input shape doesn't make for a great user experience. While the multi-output case is a royal pain to support, it exists, and any PR that merely errors at its existence is likely to not merge. I would suggest two paths forward for that:
I think # 2 is more appropriate in this case, though I have not put a ton of thought into it and would defer to the resident tree huggers @pprett and @glouppe for their insight. On the API side. RF is getting pretty busy on the param side. While, perhaps, this solution is not necessarily an explicit class weighting as you mention, it is very much related to it, and makes for a cleaner parameter listing IMO if this was moved to the existing Would love to hear some core dev's opinions on the above as well. |
@trevorstephens I finally got around to implementing multi-output balanced random forest! I didn't quite understand the ideas you listed above. What I did is to just consider for a given output i and label j the class as the pair (i,j) and then find the minority class in that sense. Then I do a bootstrap sample of each class of size (i_min, j_min). The result won't be completely balanced as there could be a correlation between classes of different outputs. So if the minority class of the first output is correlated with a majority class of the second output, that majority class will be still be majority in the balanced sample. Does this work for you? I included an example in in brf_multioutput.py. If you prefer something else could you expand on your algorithm above? |
I will try to find time to look it over soon @potash In the meantime you may like to Adding tests will be required for any eventual merge. There's some you could butcher for I still feel that this should be incorporated into the |
As for the "algorithm" I mentioned. Pretty simple idea:
So the independent balanced samples for I'll have to look at your example and think about potential corner-cases for your solution to the multi-output case. It sounds rational at first glance though. |
@trevorstephens Ah, so my method is similar to your second one except instead of samples of size 3 and 2 for y0, y1 respectively it will just use size 2 (the global min) for all y_i. Let me know what you think. I'll get to the documentation and tests soon. |
0c2a0ed
to
8a5a645
Compare
8a5a645
to
59f7c85
Compare
Succeeded by #13227. |
I have implemented balanced random forest as described in Chen, C., Liaw, A., Breiman, L. (2004) "Using Random Forest to Learn Imbalanced Data", Tech. Rep. 666, 2004. It is enabled using the balanced=True parameter to RandomForestClassifier.
This is related to the class_weight='subsample' feature already available but instead of down-weighting majority class(es) it undersamples them. According to the referenced paper (and personal experience) balanced random forests perform well for very imbalanced data.
In order to do the balanced sampling we need some class summary data (distribution of classes, etc.). For efficiency, this is precomputed in fit() by the _get_balance_class_data() function and then passed to _parallel_build_trees() which, when specified, calls generate_balanced_sample_indices() instead of the default _generate_sample_indices().
If there is interest in this feature, I'd be happy to write some tests for it and discuss code style, etc. Thanks!