[WIP] PERF compute float32 grad and hess in HGBT #18659

lorentzenchr · 2020-10-20T20:44:03Z

What does this implement/fix? Explain your changes.

The BaseHistGradientBoosting is using float32 for storing gradiends and hessians. Nevertheless, the computations are done using full float64 and only at the end are they converted to float32.
One can gain some perfomance, i.e. less fit time, by doing expensive calculations of exp in float32, too.

This PR basically replaces some Cython calls to double exp(double) by float expf(float).

Any other comments?

Some test accuracies must be lowered, too.

lorentzenchr · 2020-10-20T20:47:48Z

Benchmark

Edit: For better benchmarks, see comments/posts below.

Running `python scikit-learn/benchmarks/bench_hist_gradient_boosting.py`

Scikit-Learn Version 0.23

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.154s,
score duration: 0.002s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.246s,
score duration: 0.004s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.532s,
score duration: 0.022s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.072s,
score duration: 0.113s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.467s,
score duration: 0.246s,

THIS PR

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.069s,
score duration: 0.002s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.070s,
score duration: 0.002s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.316s,
score duration: 0.018s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 0.815s,
score duration: 0.105s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.158s,
score duration: 0.256s,

NicolasHug · 2020-10-21T12:17:01Z

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx

@@ -153,11 +156,12 @@ def _update_gradients_hessians_categorical_crossentropy(
        int n_samples = raw_predictions.shape[1]
        int k  # class index
        int i  # sample index
-        Y_DTYPE_C sw
+        G_H_DTYPE_C sw


why change sw?

Otherwise, in line 187-188, gradients[k, i] = XXX * sw, the multiplication on the right-hand side would be a double, which is then down-casted to float32 for assignment to the left-hand side.
Why doing the multiplication in double precision, when it's down-casted to float anyway?

Why doing the multiplication in double precision, when it's down-casted to float anyway?

To avoid too many casts since casting isn't free. Though I don't know enough of the details to understand how costly they are, and how/when exactly they happen.

Note that according to this logic, y_true in _update_gradients_hessians_categorical_crossentropy should be converted to float too since we do gradients[i] = p_i - y_true[i] (I don't think we should, just noting a discrepancy.)

y_true is a memory view, i.e. double[:]. I guess there is no implicit type conversion (possible). Accoding to this answer, the underlying type is a struct containing pointers. Providing a float32 array as argument throws an error instead of casting.

NicolasHug · 2020-10-21T12:17:26Z

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx

        sum_exps += p[i, k]

    for k in range(prediction_dim):
        p[i, k] /= sum_exps


-cdef inline Y_DTYPE_C _cexpit(const Y_DTYPE_C x) nogil:
+cdef inline G_H_DTYPE_C _cexpit(const G_H_DTYPE_C x) nogil:


this is now expecting a float but what is passed to _cexpit is still a double

If I understand C correctly, in particular type casting, the argument of _cexpit will be implicitly converted to the type as specified in function definition, which is now float32.

NicolasHug · 2020-10-21T12:17:28Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_loss.py

+    assert_allclose(loss.inverse_link_function(optimum), y_true, atol=2e-5)
+    assert_allclose(func(optimum), 0, atol=2e-11)
+    assert_allclose(get_gradients(y_true, optimum), 0, atol=2e-5)


these are quite big differences!

Yes, they are. This PR loses some precision of gradients and hessian. This stems from calling expf instead of exp. This is somehow amplified the solver newton. I had to reduce it's tolerance, the lower tol in the assertions is a consequence.

The big question is if this has an impact on fitting (convergence and attained training loss).

NicolasHug · 2020-10-21T12:17:50Z

@lorentzenchr could you provide benchmarks over many iterations (ideally in a table or a graph, with stddev ;) )

The Hist-GBDT code is pretty unstable w.r.t. computation time and reporting just one run isn't accurate enough.

Could you also check what lightGBM does?

lorentzenchr · 2020-10-21T16:35:29Z

@NicolasHug Thanks for you fast review. I'll provide better benchmarks as soon as I find time.
I'm not yet entirely convinced that we want to have this. But I wanted to show some untapped performance potential.

ogrisel · 2020-10-23T07:58:13Z

I doubt that loss gradient/hessian computations are performance critical. This would need to be checked via some profiling, e.g. using https://github.com/emeryberger/scalene or https://github.com/benfred/py-spy ) but I wouldn't mind making the code more consistent. However the loss of precision induced by expf is quite important...

Could you also check what lightGBM does?

I agree this is a good idea :)

If we decide to keep the gradient values in float32 but upcast locally to float64 for some intermediate computations that are numerically sensitive we should add inline comments to explain that this is intentional.

lorentzenchr · 2020-10-23T11:23:14Z

Trying to understand lightgbm source code, I think they do as we currently do, see here:

Calculate in double precision, i.e. exp(double), then cast to score_t which is either doubleor float, see here.

ogrisel · 2020-10-23T12:51:54Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_loss.py

@@ -91,14 +91,14 @@ def fprime2(x: np.ndarray) -> np.ndarray:
        return get_hessians(y_true, x)

    optimum = newton(func, x0=x0, fprime=fprime, fprime2=fprime2,
-                     maxiter=70, tol=5e-6)
+                     maxiter=70, tol=5e-7)


Interesting. I am not sure what we should conclude from the fact that more newton iteration converge to the true optimum even with a lower precision gradient computation...

ogrisel · 2020-10-23T12:53:53Z

Calculate in double precision, i.e. exp(double), then cast to score_t which is either double or float, see here.

You could use a C/C++ debugger to check what is the concrete score_t type when running LightGBM from the usual Python bindings.

lorentzenchr · 2020-10-23T18:32:59Z

Benchmark results are unclear. Until 100 000 samples, this PR seems beneficial. For more sampels, this PR seems worse.

Note that every point is the average over 3 runs. X-axis is in units of 1000.

Until now, I have not found difference in training log loss worth mentioning.

from time import time

import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import log_loss

from neurtu import delayed, timeit, Benchmark


def make_benchmark(
    n_features=50,
    ax=None,
    n_samples_range=np.linspace(100, 10_000, num=10, dtype=np.int),
    repeat=1,
    metric_name="wall_time",
    **kwargs,
):
    """Benchmark HGBT for criterion 'mse' and 'poisson'"""

    dataset_kwargs = dict(
        n_samples=n_samples_range.max(),
        n_features=n_features,
        n_informative=n_features * 2 // 3,
        random_state=42,
    )
    X, y = make_classification(
        n_clusters_per_class=1, **dataset_kwargs
        )

    bench = []
    
    # nicer numbers
    if n_samples_range.max() > 10_000:
        n_samples_scale = 1000
    else:
        n_samples_scale = 1
    
    for n_samples in n_samples_range:
        model = HistGradientBoostingClassifier(**kwargs)
        bench.append(
            delayed(
                model,
                tags={"n_samples": n_samples / n_samples_scale},
            ).fit(X[:n_samples, :n_features], y[:n_samples])
        )
    res = Benchmark(repeat=repeat, **{metric_name: True})(bench)
    if repeat > 1:
        res = res.loc[:, (metric_name, "mean")]
    else:
        res = res.loc[:, metric_name]
    ax = res.plot(marker="o", ax=ax)
    ax.set(
        title=f"n_features={n_features}",
        xlim=(0, n_samples_range.max() / n_samples_scale),
    )
    if metric_name == "wall_time":
        ax.set(ylabel="wall time (s)")
    elif metric_name == "peak_memory":
        ax.set(ylabel="peak memory (MB)")
    if n_samples_scale > 1:
        ax.set(xlabel="n_samples (10³)")
    else:
        ax.set(xlabel="n_samples")
        
    return res, ax
    return res


fig, ax = plt.subplots(1, 2, figsize=(12, 4))

n_samples_range = np.r_[np.linspace(1000, 50_000, num=10, dtype=np.int),
                        np.linspace(100_000, 1000_000, num=5, dtype=np.int)]

res_f10, _ = make_benchmark(n_features=10, n_samples_range=n_samples_range,
                            ax=ax[0], repeat=3)
res_f100, _ = make_benchmark(n_features=100, n_samples_range=n_samples_range,
                             ax=ax[1], repeat=3)

fig.suptitle("HGBT Binary Classifier"
    + " benchmark on make_classification() data"
)

Results

# n_samples float32 / master
ratio_f10 = np.array([
    [1.000 , 0.373504 / 0.555648],
    [6.444 , 0.424278 / 0.588441],
    [11.888, 0.505906 / 0.777491],
    [17.333, 0.594497 / 0.704212],
    [22.777, 0.625491 / 1.090951],
    [28.222, 0.793513 / 0.981356],
    [33.666, 0.742936 / 0.899285],
    [39.111, 0.903978 / 0.963037],
    [44.555, 0.946626 / 0.987644],
    [50.000, 0.940014 / 1.017219],
    [100.00, 1.410459 / 1.337301],
    [325.00, 3.347485 / 2.849119],
    [550.00, 5.130298 / 4.385706],
    [775.00, 6.545066 / 6.343583],
    [1000.0, 8.782194 / 7.953538],
])

ratio_f100 = np.array([
    [1.000 ,  0.547719 /  1.365256],
    [6.444 ,  1.248298 /  1.743352],
    [11.888,  1.453784 /  1.932066],
    [17.333,  1.730546 /  2.061890],
    [22.777,  1.803699 /  2.202086],
    [28.222,  1.822677 /  2.298813],
    [33.666,  2.045215 /  2.445927],
    [39.111,  2.257485 /  2.536569],
    [44.555,  2.342646 /  2.662033],
    [50.000,  2.556265 /  2.765742],
    [100.00,  3.880260 /  3.999123],
    [325.00, 10.163691 /  9.469649],
    [550.00, 15.635915 / 13.668904],
    [775.00, 20.412055 / 17.654048],
    [1000.0, 24.707967 / 21.811183],
])

fig, ax = plt.subplots(1, 2, figsize=(12, 4))

ax[0].plot(ratio_f10[:, 0], ratio_f10[:, 1], marker="o")
ax[1].plot(ratio_f100[:, 0], ratio_f100[:, 1], marker="o")
for i, n_features in enumerate([10, 100]):
    ax[i].set(
        title=f"n_features={n_features}",
        ylabel="time float32 / time master",
        xlabel="n_samples (10³)",
        xscale="log",
    )
fig.suptitle("Time Ratio float32 / master")

NicolasHug · 2020-10-23T18:52:27Z

Thanks for the benchmarks @lorentzenchr

I think we should explicitly deactivate early stopping for such benchmarks because it may impact the actual number of iterations, especially if there are differences in how the losses are computed: early_stopping=False

Also absolute times are worth mentioning: a ratio of 1.2 is probably more significant when fitting time takes hours vs a few ms.

For now I can't say I'm convinced that such change is needed because it's quite subtle; there's a solid chance we may break/revert it in the future without having a way of catching it, and the benefits are unclear so far.

BTW, if the plan is to improve the hist-GBDTs, one (not-so-low) hanging fruit is to make early stopping on scorers faster: right now, to compute the score at iteration i, we need to re-compute all the predictions form iteration 0 to iteration i. It's really slow but that's what the current design of the scorers forces us to do. It'd be great if we could make that faster by caching the predictions, just like we do when early stopping is checked on the loss.

lorentzenchr · 2020-10-23T20:14:55Z

The same with early_stopping=False:

Absolute timing of this PR (see also detail section of the benchmark above):

Comparing training log loss for n_features=20 gives:

n_samples = 1e+04: logloss PR/logloss master = 9.99999999e-01
n_samples = 1e+05: logloss PR/logloss master = 9.99999997e-01
n_samples = 1e+06: logloss PR/logloss master = 9.98733125e-01

How can this PR get better training error?

Conclusion: Benchmarking is tricky. Maybe I'll rerun again.

lorentzenchr · 2020-10-23T20:18:02Z

@NicolasHug Thanks for the tip. If time and climbing expertise permits, I'll have a look at the other not so low hanging fruits 🍎

ogrisel · 2020-10-26T09:02:09Z

How can this PR get better training error?

Maybe just luck in the rounding errors. The difference does not seems that big.

What I do not understand is why you observe a speed up with n_samples >1e5 and early_stopping = False but a slow down with early_stopping = True while the training loss is either the same or even slightly better with you code change.

ogrisel · 2020-10-26T09:03:56Z

I would love to see the same benchmark repeated 5 times or more with the mean or trimmed mean time + std error bars to be confident that this change is really an improvement.

jjerphan · 2022-03-11T10:43:53Z

@lorentzenchr: what is the current status of this PR? Should it be revisited with the recent changes for loss functions?

lorentzenchr · 2022-03-11T12:22:30Z

@lorentzenchr: what is the current status of this PR? Should it be revisited with the recent changes for loss functions?

I'm not sure, I think we can close it.
In the new loss function module, the single data point functions all work with double precision (as before in HGBT).

Therefore, I guess the timings wouldn't change.
On top, I'm not sure adding 32bit float would be that easy without code bloat. (I would be interested to know if I'm mistaken.)

Given what @NicolasHug said:

BTW, if the plan is to improve the hist-GBDTs, one (not-so-low) hanging fruit is to make early stopping on scorers faster: right now, to compute the score at iteration i, we need to re-compute all the predictions form iteration 0 to iteration i. It's really slow but that's what the current design of the scorers forces us to do. It'd be great if we could make that faster by caching the predictions, just like we do when early stopping is checked on the loss.

my conclusion is closing this an focusing on the larger bottlenecks instead.

ENH compute float32 grad and hess in HGBT

011961e

github-actions bot added the module:ensemble label Oct 20, 2020

NicolasHug reviewed Oct 21, 2020

View reviewed changes

lorentzenchr added the Performance label Oct 21, 2020

TST higher test accuracy

1c73fcc

ogrisel reviewed Oct 23, 2020

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:53

lorentzenchr mentioned this pull request Apr 6, 2021

[MRG] Common Private Loss Module #19088

Closed

8 tasks

thomasjpfan added the cython label Apr 13, 2021

lorentzenchr closed this Mar 11, 2022

lorentzenchr deleted the hgbt_float32 branch March 11, 2022 12:23

lorentzenchr mentioned this pull request Mar 26, 2023

Improve early stopping of HGBT for scorers passed as string #25974

Closed

Uh oh!

[WIP] PERF compute float32 grad and hess in HGBT #18659

[WIP] PERF compute float32 grad and hess in HGBT #18659

Uh oh!

Conversation

lorentzenchr commented Oct 20, 2020

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

lorentzenchr commented Oct 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Scikit-Learn Version 0.23

THIS PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Oct 21, 2020

Uh oh!

lorentzenchr commented Oct 21, 2020

Uh oh!

ogrisel commented Oct 23, 2020

Uh oh!

lorentzenchr commented Oct 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 23, 2020

Uh oh!

lorentzenchr commented Oct 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Oct 23, 2020

Uh oh!

lorentzenchr commented Oct 23, 2020

Uh oh!

lorentzenchr commented Oct 23, 2020

Uh oh!

ogrisel commented Oct 26, 2020

Uh oh!

ogrisel commented Oct 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented Mar 11, 2022

Uh oh!

lorentzenchr commented Mar 11, 2022

Uh oh!

Uh oh!

lorentzenchr commented Oct 20, 2020 •

edited

Loading

lorentzenchr commented Oct 23, 2020 •

edited

Loading

ogrisel commented Oct 26, 2020 •

edited

Loading