-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Speed up example on plot_gradient_boosting_categorical.py #21634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up example on plot_gradient_boosting_categorical.py #21634
Conversation
…of iterators, along with code changes
I'm not sure why the CI fails with the following error, since I didn't modify the corresponding line :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Here are a few comments:
The CI error seems unrelated to this PR and caused by an unfortunate incompatibility between new numpy and old pandas (see numpy/numpy#18355). |
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. However I am not 100% sure the added complexity of the column filtering is worth the speed benefit.
Also limiting max_iter
to 50 for the first part of the example makes the difference with the last part of the example a bit less visible since the models for the first part of the example are now underfitting a bit as well (the native categorical splits is always better than OHE and Ordinal Encoding while it was not the case before).
/cc @NicolasHug.
mape = [np.mean(np.abs(item)) for item in items] | ||
std_pred = [np.std(item) for item in items] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what std_pred
stands for. Maybe the following would be more explicit:
mape = [np.mean(np.abs(item)) for item in items] | |
std_pred = [np.std(item) for item in items] | |
mape_cv_mean = [np.mean(-item) for item in items] | |
mape_cv_std = [np.std(item) for item in items] |
the lines below will also need to be adjusted.
print(f"Number of categorical features: {n_categorical_features}") | ||
print(f"Number of numerical features: {n_numerical_features}") | ||
print(f"Number of categorical features: {n_columns}") | ||
print(f"Number of numerical features: {n_columns}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will become wrong if n_columns
is adjusted to another value. Please move back n_categorical_features
and n_numerical_features
here instead.
Thanks for your comments. I thought that reducing the number of columns would be the most promising option since 80 columns since unnecessary for the example to be relevant. I agree that the selection procedure adds too much complexity. Would you have any idea on either how to column selection in a clear and simple way (I was thinking we could manually select the columns and store them in a list at the beginning of the example but it doesn't look very clean either) or more generally how to reduce the training time? Since the first part seems to be a bit underfitting with the new number of iterations, maybe I could increase the |
Maybe you can subselect 10 categorical columns and 5 numerical columns manually by name and only change 1 line in the example? Try to focus on columns that are the most informative by running a permutation importance analysis on this dataset (outside of the example). |
@ogrisel I ran the permutation importance you describe and selected 10 categorical variables and 10 numerical variables. They don't exactly match the previous ones, particularly because the version dropping categorical performs now significantly worse than the other versions (since there are fewer numerical variables, it may be harder for the 'dropped' version to learn enough). However, the benefit of using the native handling of categorical is shown. The new version now takes 4 seconds to run (instead of ~15). What do you think? |
It makes sense. I think this is not a problem.
And in my opinion, more importantly, the good predictive performance of OHE and Ordinal Encoder strategies when the model does not underfit it still there. Can you confirm that you do not limit
This looks good, please feel free to update your PR accordingly. |
Ok, I just updated the PR. I can confirm that I restored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The running time gain goes only from 12 sec to 9 sec, but I am not sure how far we can go while keeping all the insights.
This PR also cleans up a few things in the example. Thanks !
This is a nice improvement, but we have this warning:
It'd be nice if you could investigate the issue @pedugnat . Merging this one as is, since the warning is not caused by this PR. |
…-learn#21634) * sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
…-learn#21634) * sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
…-learn#21634) * sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
* sped up the example by reducing the number of columns and the number of iterators, along with code changes * removed warning filter as it breaks the CI * change in a comment Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org> * fixed typo * made code more robust as per PR comment * applied black * applied black Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Reference Issues/PRs
Contributes to #21598.
What does this implement/fix? Explain your changes.
In order to speed up the example, I did two main things :
HistGradientBoostingRegressor
from 100 to 50Please see the before-after for the plots, and the cProfile of the old and new versions: on my computer, the new version takes 4.5s vs 15.5s; the rank of the fit time and the error (plots) is the same for both figures:
cProfiles new version:
3894443 function calls (3808982 primitive calls) in 4.671 seconds
cProfiles old version:
6640438 function calls (6468083 primitive calls) in 15.580 seconds
plots new version :


plots old version

