Skip to content

[WIP] Loss Module: replace in logistic regression and hgbt #19089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

Solves #15123. Follow-up on #19088.

What does this implement/fix? Explain your changes.

The losses in sklearn/ensemble/_hist_gradient_boosting and in sklearn.linear_model._logistic are replace with the new losses from #19088.

Any other comments?

Needs benchmarking. First rudimentary results indicate no regression.

@lorentzenchr lorentzenchr force-pushed the loss_module_logistic_and_hgbt branch from d924b4f to 4e00738 Compare January 2, 2021 15:38
@lorentzenchr
Copy link
Member Author

lorentzenchr commented Jan 18, 2021

Benchmarks of Fit Time

Hardware: Intel Core i7-8559U, 8th generation, 16 GB RAM
Software: Python 3.7.9, numpy 1.19.5, scipy 1.5.2
See: https://github.com/lorentzenchr/notebooks/blob/master/bench_loss_module_logistic_and_hgbt.ipynb
This PR (based on 6b4f824) and master (5a63f90) both compiled the same way.

HistGradientBoostingClassifier

N=n_samples, n_features=20
Error bars from 20 runs each.

OMP_NUM_THREADS=6 (hardware maximum is 8)

image

OMP_NUM_THREADS=1

image

Summary HGBT

Binary Classification
  • One sees a slight perfomance degradation in the multi-threaded setting, but an improvement in the single-threaded setting. As the new implementation is numerically more stable at a slight computational cost, some degradation was anticipated. The surprising fact is the improvement in single-threaded mode.
  • Visible in the multi-threaded setting: With early stopping enabled, the parallel computation of the loss gives overall a slight speedup.
Multiclass Classification
  • A slight performance improvement is indicated. This is expected as this PR uses one loop over n_classes less than master.

LogisticRegression - lbfgs and newton-cg

n_features=50
Error bars from 10 runs each.
image

Summary Logistic Regression

  • Binary Logistic Regression: Overall improvement in single threaded mode. Multi-threaded mode seems to become interesting for large sample sizes.
  • Multiclass Logistic Regression: Same as binary. Note that memory consumption should be lower as this PR uses label encoded target (aka ordinal encoding) while master uses binarized label encoding (aka one-hot encoding).

Last edit: 14 March 2021

Base automatically changed from master to main January 22, 2021 10:53
@NicolasHug
Copy link
Member

Thanks for the benchmarks!

I only checked the HistGradientBoosting ones. Are these the times spent in fit, or the times spent computing the losses only? (both would be intesting although for the latter we probably don't need estimators, we can just call the losses directly).

We probably don't need to worry about N < 10k as times are pretty short there (mostly < 1s, so the improvement ratio is less significant)

Also, what's the reason for benchmarking with and without early stopping? Is it so that we get more loss computations, or is it because the loss results might change significantly enough to affect the early-stopping decision?

It might be worth benchmarking OMP_NUM_THREADS=1 as well to get an idea of the intrinsic benefits of the new implem, without multi-threading

@GaelVaroquaux
Copy link
Member

Nice benchmarks! Overall improvements, and the losses are minor.

@lorentzenchr
Copy link
Member Author

@NicolasHug I updated the benchmarks to include a single-threaded case for HGBT. All times are fit times as now written in the header.

@rth
Copy link
Member

rth commented Nov 27, 2021

It might make sense to only have logistic regression in this PR and have HGBT separately in #20811?

@lorentzenchr
Copy link
Member Author

@rth I'll open a new PR for LogisticRegression and another one for the other GLMs, that's cleaner.

@lorentzenchr lorentzenchr deleted the loss_module_logistic_and_hgbt branch November 28, 2021 14:19
@lorentzenchr
Copy link
Member Author

Loss replacement for HGBT is in #20811.
Loss replacement for LogisticRegression is in #21808.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants