[WIP] Loss Module: replace in logistic regression and hgbt #19089

lorentzenchr · 2021-01-01T22:21:32Z

Reference Issues/PRs

Solves #15123. Follow-up on #19088.

What does this implement/fix? Explain your changes.

The losses in sklearn/ensemble/_hist_gradient_boosting and in sklearn.linear_model._logistic are replace with the new losses from #19088.

Any other comments?

Needs benchmarking. First rudimentary results indicate no regression.

lorentzenchr · 2021-01-18T10:04:49Z

Benchmarks of Fit Time

Hardware: Intel Core i7-8559U, 8th generation, 16 GB RAM
Software: Python 3.7.9, numpy 1.19.5, scipy 1.5.2
See: https://github.com/lorentzenchr/notebooks/blob/master/bench_loss_module_logistic_and_hgbt.ipynb
This PR (based on 6b4f824) and master (5a63f90) both compiled the same way.

HistGradientBoostingClassifier

N=n_samples, n_features=20
Error bars from 20 runs each.

`OMP_NUM_THREADS=6` (hardware maximum is 8)

`OMP_NUM_THREADS=1`

Summary HGBT

Binary Classification

One sees a slight perfomance degradation in the multi-threaded setting, but an improvement in the single-threaded setting. As the new implementation is numerically more stable at a slight computational cost, some degradation was anticipated. The surprising fact is the improvement in single-threaded mode.
Visible in the multi-threaded setting: With early stopping enabled, the parallel computation of the loss gives overall a slight speedup.

Multiclass Classification

A slight performance improvement is indicated. This is expected as this PR uses one loop over n_classes less than master.

LogisticRegression - lbfgs and newton-cg

n_features=50
Error bars from 10 runs each.

Summary Logistic Regression

Binary Logistic Regression: Overall improvement in single threaded mode. Multi-threaded mode seems to become interesting for large sample sizes.
Multiclass Logistic Regression: Same as binary. Note that memory consumption should be lower as this PR uses label encoded target (aka ordinal encoding) while master uses binarized label encoding (aka one-hot encoding).

Last edit: 14 March 2021

NicolasHug · 2021-02-22T09:04:19Z

Thanks for the benchmarks!

I only checked the HistGradientBoosting ones. Are these the times spent in fit, or the times spent computing the losses only? (both would be intesting although for the latter we probably don't need estimators, we can just call the losses directly).

We probably don't need to worry about N < 10k as times are pretty short there (mostly < 1s, so the improvement ratio is less significant)

Also, what's the reason for benchmarking with and without early stopping? Is it so that we get more loss computations, or is it because the loss results might change significantly enough to affect the early-stopping decision?

It might be worth benchmarking OMP_NUM_THREADS=1 as well to get an idea of the intrinsic benefits of the new implem, without multi-threading

GaelVaroquaux · 2021-02-22T19:53:02Z

Nice benchmarks! Overall improvements, and the losses are minor.

lorentzenchr · 2021-03-14T11:02:07Z

@NicolasHug I updated the benchmarks to include a single-threaded case for HGBT. All times are fit times as now written in the header.

rth · 2021-11-27T13:01:06Z

It might make sense to only have logistic regression in this PR and have HGBT separately in #20811?

lorentzenchr · 2021-11-27T16:36:43Z

@rth I'll open a new PR for LogisticRegression and another one for the other GLMs, that's cleaner.

lorentzenchr · 2021-11-28T14:24:13Z

Loss replacement for HGBT is in #20811.
Loss replacement for LogisticRegression is in #21808.

github-actions bot added module:ensemble module:linear_model labels Jan 1, 2021

lorentzenchr mentioned this pull request Jan 1, 2021

[MRG] Common Private Loss Module #19088

Closed

8 tasks

lorentzenchr added 6 commits January 2, 2021 11:57

ENH add common link function submodule

a7926b9

ENH add common loss function submodule

8c3c710

ENH replace loss in HGBT

3b3a33a

MNT remove HGBT's own loss functions

9ef7b01

ENH replace loss in linear logistic regression

61f6ed7

MNT remove logistic regression's own loss functions

4e00738

lorentzenchr force-pushed the loss_module_logistic_and_hgbt branch from d924b4f to 4e00738 Compare January 2, 2021 15:38

Base automatically changed from master to main January 22, 2021 10:53

thomasjpfan added the cython label Apr 13, 2021

lorentzenchr mentioned this pull request Jul 28, 2021

[MRG] Common Private Loss Module with tempita #20567

Merged

lorentzenchr mentioned this pull request Aug 22, 2021

ENH Replace loss module HGBT #20811

Merged

lorentzenchr closed this Nov 27, 2021

lorentzenchr mentioned this pull request Nov 28, 2021

ENH Loss module LogisticRegression #21808

Merged

lorentzenchr deleted the loss_module_logistic_and_hgbt branch November 28, 2021 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Loss Module: replace in logistic regression and hgbt #19089

[WIP] Loss Module: replace in logistic regression and hgbt #19089

lorentzenchr commented Jan 1, 2021

lorentzenchr commented Jan 18, 2021 •

edited

Loading

NicolasHug commented Feb 22, 2021

GaelVaroquaux commented Feb 22, 2021

lorentzenchr commented Mar 14, 2021

rth commented Nov 27, 2021

lorentzenchr commented Nov 27, 2021

lorentzenchr commented Nov 28, 2021

[WIP] Loss Module: replace in logistic regression and hgbt #19089

[WIP] Loss Module: replace in logistic regression and hgbt #19089

Conversation

lorentzenchr commented Jan 1, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

lorentzenchr commented Jan 18, 2021 • edited Loading

Benchmarks of Fit Time

HistGradientBoostingClassifier

OMP_NUM_THREADS=6 (hardware maximum is 8)

OMP_NUM_THREADS=1

Summary HGBT

Binary Classification

Multiclass Classification

LogisticRegression - lbfgs and newton-cg

Summary Logistic Regression

NicolasHug commented Feb 22, 2021

GaelVaroquaux commented Feb 22, 2021

lorentzenchr commented Mar 14, 2021

rth commented Nov 27, 2021

lorentzenchr commented Nov 27, 2021

lorentzenchr commented Nov 28, 2021

lorentzenchr commented Jan 18, 2021 •

edited

Loading

`OMP_NUM_THREADS=6` (hardware maximum is 8)

`OMP_NUM_THREADS=1`