-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Merge PresortBestSplitter and BestSplitter #5252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -251,6 +255,11 @@ def fit(self, X, y, sample_weight=None): | |||
|
|||
random_state = check_random_state(self.random_state) | |||
|
|||
X_idx_sorted = None | |||
if self.presort == True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.presort:
is more pythonic
Ran some time tests with and without presorting for GradientBoostingClassifier.
So it looks like some function of max_depth, n_samples, and n_features corresponds to training time. Do you think it's worth finding a good switch, or mentioning it in the docs and leaving it to users to determine if they want to presort or not? |
I would leave that to the user and only enable presort for boosting. Let's keep things simple :) |
I have added unit tests for Gradient Boosting and Forests to ensure that presorting returns the same results as not using presorting. I have also added unit tests for Gradient Boosting testing sparse inputs, and ensuring that they give the same results as dense inputs. Please review when you have the time. |
8fd9f2a
to
ba674b6
Compare
First thanks @jmschrei to give love to those codes. Here a few thoughts about this pull request. I find great that you add sparsity support to gbrt. (Though I would have done that in another pr). As pointed by some people in the @PPRET pull request, it is needed to store the sparse I understand the need and the benefits to merge more of the logic between pre-sort best splitter and the best spliter. In this pr, it is proposed to interleave the codes with I am not fan of adding a |
Thanks for the comments @arjoly. To be clear, I haven't added in sparsity support for presorting. I added an option to turn presorting off for gradient boosting, and when you do, you get the same sparsity support that decision trees/random forests get. Thus not having a separate PR for it. I can understand the concern about spaghetti code. Presorting and best splitting are so similar that merging them makes much more sense than any other merger, because it's not an algorithmic difference in how the split is calculated, just in how Xf is prepared for the split. This requires only the addition of 31 lines of code (15 of which are initializing and declaring variables) versus having an entirely new object and ~250 lines of repeated code. Having inline methods may be the correct way to merge more code in the future. I was thinking about factorizing the node_split method into several methods, such as "sort_array", "sort_array_w_presorting", and "sort_sparse_array", which would return Xf. Then have a "best_split" method which takes in Xf and finds the best split, or a "random_split" method which finds the best random split. Lastly, a "partition_samples" method which is shared. I haven't fully formed the idea yet, but if you have ideas I'd love to see them. I like the addition of presorting for these other methods, but am not tied to it--it seems like a "why not" case. It's easy to do add in, the default behaviour is not to use it so users won't notice a difference unless they know what they're doing, and should it become difficult to maintain it can be easily removed. Maybe @glouppe will have an opinion. |
It is true that presorting is not very useful outside of GBRT. +1 for not adding this parameter in forests and keep the public API as it is. |
Presorting has been removed as an option for tests. Code, unit tests, and documentation have been updated to reflect this. |
48e826f
to
ec28f15
Compare
@@ -934,6 +935,13 @@ def fit(self, X, y, sample_weight=None, monitor=None): | |||
computing held-out estimates, early stopping, model introspect, and | |||
snapshoting. | |||
|
|||
presort : bool, optional (default=False) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this empty line
You should test for presorting in
to
(and the same for the regressor) |
Thanks for the review @glouppe! Good catch, I forgot to change that over. I've incorporated all your comments. It looks like a bunch of TSNE unit tests are failing. :( |
Great, thanks for the changes! +1 for merge on my side. |
Yep, rebased this branch onto master. |
I've addressed comments, added an 'auto' default option, and added some more unit tests. Since this PR is getting rather hefty now, I am going to submit another PR in the future applying the same style changes to _splitter.pyx as _criterion.pyx, instead of handling that here. |
To grow a tree on sparse, you need the data to be in sparse csc format. However to make a prediction, you need a sparse csr data.This means that whenever you build a boosting models at each step you: |
@@ -1318,6 +1317,12 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin): | |||
If None, the random number generator is the RandomState instance used | |||
by `np.random`. | |||
|
|||
presort : bool, optional (default='auto') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is presort : bool or 'auto', optional (default='auto')
Got it. Apologies for the confusion. |
Comments have been incorporated. For sparse data, csc matrices are passed into fitting functions, and csr matrices are passed into prediction functions. Thanks @arjoly for the thorough review. |
Any more comments @arjoly ? If not I believe we can merge this one. |
|
||
self.estimators_ = np.empty((0, 0), dtype=np.object) | ||
|
||
def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask, | ||
criterion, splitter, random_state): | ||
random_state, X_idx_sorted, X_csc=None, X_csr=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it would be better to have X_fit and X_predict instead of X, X_csc and X_csr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. X_csc and X_csr is explicit about what those are. If I saw X_fit and X_predict in the wild, I'd assume that they were different datasets. Here they are the same dataset.
I made a last round of review. |
Comments have been into account, and commits have been squashed. |
LGTM |
[MRG+1] Merge PresortBestSplitter and BestSplitter
thanks @jmschrei |
thanks @jmschrei
Great job once again!
|
🎺 |
This pull request merges PresortBestSplitter into BestSplitter, without loss of functionality. All tree based estimators (Decision Trees, Random Forests, Extra Trees, and Gradient Boosting) have an optional parameter "presort," which is default false for all except Gradient Boosting.
In addition to allowing other estimators to turn on presorting on smaller data sets, it allows Gradient Boosting to turn off presorting on large datasets. A good default for this switch needs to be found, as it currently is done manually. Gradient Boosting can now also work on sparse data by turning off the presorting option.
Here is a checklist of what needs to be done (much more manageable than my last one):
As a side note, I'm getting some pretty high variance running this benchmark, with sometimes my branch being faster and sometimes slower, by up to 50s. Can someone else run the benchmarks a few times and see what they get?
@arjoly @pprett @glouppe