Skip to content

Speed up example on plot_gradient_boosting_categorical.py #21634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pedugnat
Copy link
Contributor

@pedugnat pedugnat commented Nov 11, 2021

Reference Issues/PRs

Contributes to #21598.

What does this implement/fix? Explain your changes.

In order to speed up the example, I did two main things :

  1. Decrease the number of columns of the chosen dataset from 80 to 20, keeping both numerical and categorical columns
  2. Decrease the number of iterations of the HistGradientBoostingRegressor from 100 to 50

Please see the before-after for the plots, and the cProfile of the old and new versions: on my computer, the new version takes 4.5s vs 15.5s; the rank of the fit time and the error (plots) is the same for both figures:

  • cProfiles new version: 3894443 function calls (3808982 primitive calls) in 4.671 seconds

  • cProfiles old version: 6640438 function calls (6468083 primitive calls) in 15.580 seconds

  • plots new version :
    Figure_1
    Figure_2

  • plots old version
    sphx_glr_plot_gradient_boosting_categorical_001
    sphx_glr_plot_gradient_boosting_categorical_002

@pedugnat
Copy link
Contributor Author

I'm not sure why the CI fails with the following error, since I didn't modify the corresponding line :

File "/home/circleci/project/examples/ensemble/plot_gradient_boosting_categorical.py", line 33, in <module>
X, y = fetch_openml(data_id=41211, as_frame=True, return_X_y=True)

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
Here are a few comments:

@TomDLT
Copy link
Member

TomDLT commented Nov 11, 2021

The CI error seems unrelated to this PR and caused by an unfortunate incompatibility between new numpy and old pandas (see numpy/numpy#18355).

@TomDLT TomDLT changed the title Speed up example gradient boosting categorical plot_gradient_boosting_early_stopping.py #21598 Speed up example on categorical gradient boosting Nov 11, 2021
@TomDLT TomDLT mentioned this pull request Nov 11, 2021
41 tasks
pedugnat and others added 3 commits November 11, 2021 22:55
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. However I am not 100% sure the added complexity of the column filtering is worth the speed benefit.

Also limiting max_iter to 50 for the first part of the example makes the difference with the last part of the example a bit less visible since the models for the first part of the example are now underfitting a bit as well (the native categorical splits is always better than OHE and Ordinal Encoding while it was not the case before).

/cc @NicolasHug.

Comment on lines 176 to 177
mape = [np.mean(np.abs(item)) for item in items]
std_pred = [np.std(item) for item in items]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what std_pred stands for. Maybe the following would be more explicit:

Suggested change
mape = [np.mean(np.abs(item)) for item in items]
std_pred = [np.std(item) for item in items]
mape_cv_mean = [np.mean(-item) for item in items]
mape_cv_std = [np.std(item) for item in items]

the lines below will also need to be adjusted.

print(f"Number of categorical features: {n_categorical_features}")
print(f"Number of numerical features: {n_numerical_features}")
print(f"Number of categorical features: {n_columns}")
print(f"Number of numerical features: {n_columns}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will become wrong if n_columns is adjusted to another value. Please move back n_categorical_features and n_numerical_features here instead.

@pedugnat
Copy link
Contributor Author

Thanks for your comments.

I thought that reducing the number of columns would be the most promising option since 80 columns since unnecessary for the example to be relevant. I agree that the selection procedure adds too much complexity.

Would you have any idea on either how to column selection in a clear and simple way (I was thinking we could manually select the columns and store them in a list at the beginning of the example but it doesn't look very clean either) or more generally how to reduce the training time?

Since the first part seems to be a bit underfitting with the new number of iterations, maybe I could increase the learning_rate instead of increasing the number of iterations? This way the time gain would be kept.

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2021

Maybe you can subselect 10 categorical columns and 5 numerical columns manually by name and only change 1 line in the example? Try to focus on columns that are the most informative by running a permutation importance analysis on this dataset (outside of the example).

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2021

@adrinjalali adrinjalali changed the title Speed up example on categorical gradient boosting Speed up example on plot_gradient_boosting_categorical.py Nov 12, 2021
@pedugnat
Copy link
Contributor Author

@ogrisel I ran the permutation importance you describe and selected 10 categorical variables and 10 numerical variables.
I filtered the dataset to keep only these columns and here are the plots :

They don't exactly match the previous ones, particularly because the version dropping categorical performs now significantly worse than the other versions (since there are fewer numerical variables, it may be harder for the 'dropped' version to learn enough). However, the benefit of using the native handling of categorical is shown.

The new version now takes 4 seconds to run (instead of ~15).

What do you think?

image
image

@ogrisel
Copy link
Member

ogrisel commented Nov 12, 2021

They don't exactly match the previous ones, particularly because the version dropping categorical performs now significantly worse than the other versions (since there are fewer numerical variables, it may be harder for the 'dropped' version to learn enough).

It makes sense. I think this is not a problem.

However, the benefit of using the native handling of categorical is shown.

And in my opinion, more importantly, the good predictive performance of OHE and Ordinal Encoder strategies when the model does not underfit it still there.

Can you confirm that you do not limit max_iter anymore in the first part of the example?

The new version now takes 4 seconds to run (instead of ~15).

What do you think?

This looks good, please feel free to update your PR accordingly.

@pedugnat
Copy link
Contributor Author

Ok, I just updated the PR.

I can confirm that I restored max_iter to its default value.

@pedugnat
Copy link
Contributor Author

@ogrisel @TomDLT any other comments on your side?

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The running time gain goes only from 12 sec to 9 sec, but I am not sure how far we can go while keeping all the insights.
This PR also cleans up a few things in the example. Thanks !

@adrinjalali
Copy link
Member

This is a nice improvement, but we have this warning:

/home/circleci/project/sklearn/datasets/_openml.py:876: UserWarning: Version 1 of dataset ames-housing is inactive, meaning that issues have been found in the dataset. Try using a newer version from this URL: https://www.openml.org/data/v1/download/20649135/ames-housing.arff

It'd be nice if you could investigate the issue @pedugnat . Merging this one as is, since the warning is not caused by this PR.

@adrinjalali adrinjalali merged commit 6a693ac into scikit-learn:main Nov 24, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
…-learn#21634)

* sped up the example by reducing the number of columns and the number of iterators, along with code changes

* removed warning filter as it breaks the CI

* change in a comment

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

* fixed typo

* made code more robust as per PR comment

* applied black

* applied black

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…-learn#21634)

* sped up the example by reducing the number of columns and the number of iterators, along with code changes

* removed warning filter as it breaks the CI

* change in a comment

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

* fixed typo

* made code more robust as per PR comment

* applied black

* applied black

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
…-learn#21634)

* sped up the example by reducing the number of columns and the number of iterators, along with code changes

* removed warning filter as it breaks the CI

* change in a comment

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

* fixed typo

* made code more robust as per PR comment

* applied black

* applied black

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
* sped up the example by reducing the number of columns and the number of iterators, along with code changes

* removed warning filter as it breaks the CI

* change in a comment

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

* fixed typo

* made code more robust as per PR comment

* applied black

* applied black

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants