Skip to content

[WIP] PERF compute float32 grad and hess in HGBT #18659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

lorentzenchr
Copy link
Member

What does this implement/fix? Explain your changes.

The BaseHistGradientBoosting is using float32 for storing gradiends and hessians. Nevertheless, the computations are done using full float64 and only at the end are they converted to float32.
One can gain some perfomance, i.e. less fit time, by doing expensive calculations of exp in float32, too.

This PR basically replaces some Cython calls to double exp(double) by float expf(float).

Any other comments?

Some test accuracies must be lowered, too.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Oct 20, 2020

Benchmark

Edit: For better benchmarks, see comments/posts below.

Running `python scikit-learn/benchmarks/bench_hist_gradient_boosting.py`

Scikit-Learn Version 0.23

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.154s,
score duration: 0.002s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.246s,
score duration: 0.004s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.532s,
score duration: 0.022s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.072s,
score duration: 0.113s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.467s,
score duration: 0.246s,

THIS PR

Data size: 1000 samples train, 1000 samples test.
Fitting a sklearn model...
score: 0.9730
fit duration: 0.069s,
score duration: 0.002s,
Data size: 10000 samples train, 10000 samples test.
Fitting a sklearn model...
score: 0.9840
fit duration: 0.070s,
score duration: 0.002s,
Data size: 100000 samples train, 100000 samples test.
Fitting a sklearn model...
score: 0.9860
fit duration: 0.316s,
score duration: 0.018s,
Data size: 500000 samples train, 500000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 0.815s,
score duration: 0.105s,
Data size: 1000000 samples train, 1000000 samples test.
Fitting a sklearn model...
score: 0.9869
fit duration: 1.158s,
score duration: 0.256s,

@@ -153,11 +156,12 @@ def _update_gradients_hessians_categorical_crossentropy(
int n_samples = raw_predictions.shape[1]
int k # class index
int i # sample index
Y_DTYPE_C sw
G_H_DTYPE_C sw
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change sw?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, in line 187-188, gradients[k, i] = XXX * sw, the multiplication on the right-hand side would be a double, which is then down-casted to float32 for assignment to the left-hand side.
Why doing the multiplication in double precision, when it's down-casted to float anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doing the multiplication in double precision, when it's down-casted to float anyway?

To avoid too many casts since casting isn't free. Though I don't know enough of the details to understand how costly they are, and how/when exactly they happen.

Note that according to this logic, y_true in _update_gradients_hessians_categorical_crossentropy should be converted to float too since we do gradients[i] = p_i - y_true[i] (I don't think we should, just noting a discrepancy.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y_true is a memory view, i.e. double[:]. I guess there is no implicit type conversion (possible). Accoding to this answer, the underlying type is a struct containing pointers. Providing a float32 array as argument throws an error instead of casting.

sum_exps += p[i, k]

for k in range(prediction_dim):
p[i, k] /= sum_exps


cdef inline Y_DTYPE_C _cexpit(const Y_DTYPE_C x) nogil:
cdef inline G_H_DTYPE_C _cexpit(const G_H_DTYPE_C x) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now expecting a float but what is passed to _cexpit is still a double

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand C correctly, in particular type casting, the argument of _cexpit will be implicitly converted to the type as specified in function definition, which is now float32.

Comment on lines 99 to 101
assert_allclose(loss.inverse_link_function(optimum), y_true, atol=2e-5)
assert_allclose(func(optimum), 0, atol=2e-11)
assert_allclose(get_gradients(y_true, optimum), 0, atol=2e-5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are quite big differences!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are. This PR loses some precision of gradients and hessian. This stems from calling expf instead of exp. This is somehow amplified the solver newton. I had to reduce it's tolerance, the lower tol in the assertions is a consequence.

The big question is if this has an impact on fitting (convergence and attained training loss).

@NicolasHug
Copy link
Member

@lorentzenchr could you provide benchmarks over many iterations (ideally in a table or a graph, with stddev ;) )

The Hist-GBDT code is pretty unstable w.r.t. computation time and reporting just one run isn't accurate enough.

Could you also check what lightGBM does?

@lorentzenchr
Copy link
Member Author

@NicolasHug Thanks for you fast review. I'll provide better benchmarks as soon as I find time.
I'm not yet entirely convinced that we want to have this. But I wanted to show some untapped performance potential.

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2020

I doubt that loss gradient/hessian computations are performance critical. This would need to be checked via some profiling, e.g. using https://github.com/emeryberger/scalene or https://github.com/benfred/py-spy ) but I wouldn't mind making the code more consistent. However the loss of precision induced by expf is quite important...

Could you also check what lightGBM does?

I agree this is a good idea :)

If we decide to keep the gradient values in float32 but upcast locally to float64 for some intermediate computations that are numerically sensitive we should add inline comments to explain that this is intentional.

@lorentzenchr
Copy link
Member Author

Trying to understand lightgbm source code, I think they do as we currently do, see here:

Calculate in double precision, i.e. exp(double), then cast to score_t which is either doubleor float, see here.

@@ -91,14 +91,14 @@ def fprime2(x: np.ndarray) -> np.ndarray:
return get_hessians(y_true, x)

optimum = newton(func, x0=x0, fprime=fprime, fprime2=fprime2,
maxiter=70, tol=5e-6)
maxiter=70, tol=5e-7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I am not sure what we should conclude from the fact that more newton iteration converge to the true optimum even with a lower precision gradient computation...

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2020

Calculate in double precision, i.e. exp(double), then cast to score_t which is either double or float, see here.

You could use a C/C++ debugger to check what is the concrete score_t type when running LightGBM from the usual Python bindings.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Oct 23, 2020

Benchmark results are unclear. Until 100 000 samples, this PR seems beneficial. For more sampels, this PR seems worse.
image
Note that every point is the average over 3 runs. X-axis is in units of 1000.

Until now, I have not found difference in training log loss worth mentioning.

from time import time

import matplotlib.pyplot as plt
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import log_loss

from neurtu import delayed, timeit, Benchmark


def make_benchmark(
    n_features=50,
    ax=None,
    n_samples_range=np.linspace(100, 10_000, num=10, dtype=np.int),
    repeat=1,
    metric_name="wall_time",
    **kwargs,
):
    """Benchmark HGBT for criterion 'mse' and 'poisson'"""

    dataset_kwargs = dict(
        n_samples=n_samples_range.max(),
        n_features=n_features,
        n_informative=n_features * 2 // 3,
        random_state=42,
    )
    X, y = make_classification(
        n_clusters_per_class=1, **dataset_kwargs
        )

    bench = []
    
    # nicer numbers
    if n_samples_range.max() > 10_000:
        n_samples_scale = 1000
    else:
        n_samples_scale = 1
    
    for n_samples in n_samples_range:
        model = HistGradientBoostingClassifier(**kwargs)
        bench.append(
            delayed(
                model,
                tags={"n_samples": n_samples / n_samples_scale},
            ).fit(X[:n_samples, :n_features], y[:n_samples])
        )
    res = Benchmark(repeat=repeat, **{metric_name: True})(bench)
    if repeat > 1:
        res = res.loc[:, (metric_name, "mean")]
    else:
        res = res.loc[:, metric_name]
    ax = res.plot(marker="o", ax=ax)
    ax.set(
        title=f"n_features={n_features}",
        xlim=(0, n_samples_range.max() / n_samples_scale),
    )
    if metric_name == "wall_time":
        ax.set(ylabel="wall time (s)")
    elif metric_name == "peak_memory":
        ax.set(ylabel="peak memory (MB)")
    if n_samples_scale > 1:
        ax.set(xlabel="n_samples (10³)")
    else:
        ax.set(xlabel="n_samples")
        
    return res, ax
    return res


fig, ax = plt.subplots(1, 2, figsize=(12, 4))

n_samples_range = np.r_[np.linspace(1000, 50_000, num=10, dtype=np.int),
                        np.linspace(100_000, 1000_000, num=5, dtype=np.int)]

res_f10, _ = make_benchmark(n_features=10, n_samples_range=n_samples_range,
                            ax=ax[0], repeat=3)
res_f100, _ = make_benchmark(n_features=100, n_samples_range=n_samples_range,
                             ax=ax[1], repeat=3)

fig.suptitle("HGBT Binary Classifier"
    + " benchmark on make_classification() data"
)

Results

# n_samples float32 / master
ratio_f10 = np.array([
    [1.000 , 0.373504 / 0.555648],
    [6.444 , 0.424278 / 0.588441],
    [11.888, 0.505906 / 0.777491],
    [17.333, 0.594497 / 0.704212],
    [22.777, 0.625491 / 1.090951],
    [28.222, 0.793513 / 0.981356],
    [33.666, 0.742936 / 0.899285],
    [39.111, 0.903978 / 0.963037],
    [44.555, 0.946626 / 0.987644],
    [50.000, 0.940014 / 1.017219],
    [100.00, 1.410459 / 1.337301],
    [325.00, 3.347485 / 2.849119],
    [550.00, 5.130298 / 4.385706],
    [775.00, 6.545066 / 6.343583],
    [1000.0, 8.782194 / 7.953538],
])

ratio_f100 = np.array([
    [1.000 ,  0.547719 /  1.365256],
    [6.444 ,  1.248298 /  1.743352],
    [11.888,  1.453784 /  1.932066],
    [17.333,  1.730546 /  2.061890],
    [22.777,  1.803699 /  2.202086],
    [28.222,  1.822677 /  2.298813],
    [33.666,  2.045215 /  2.445927],
    [39.111,  2.257485 /  2.536569],
    [44.555,  2.342646 /  2.662033],
    [50.000,  2.556265 /  2.765742],
    [100.00,  3.880260 /  3.999123],
    [325.00, 10.163691 /  9.469649],
    [550.00, 15.635915 / 13.668904],
    [775.00, 20.412055 / 17.654048],
    [1000.0, 24.707967 / 21.811183],
])

fig, ax = plt.subplots(1, 2, figsize=(12, 4))

ax[0].plot(ratio_f10[:, 0], ratio_f10[:, 1], marker="o")
ax[1].plot(ratio_f100[:, 0], ratio_f100[:, 1], marker="o")
for i, n_features in enumerate([10, 100]):
    ax[i].set(
        title=f"n_features={n_features}",
        ylabel="time float32 / time master",
        xlabel="n_samples (10³)",
        xscale="log",
    )
fig.suptitle("Time Ratio float32 / master")

@NicolasHug
Copy link
Member

Thanks for the benchmarks @lorentzenchr

I think we should explicitly deactivate early stopping for such benchmarks because it may impact the actual number of iterations, especially if there are differences in how the losses are computed: early_stopping=False

Also absolute times are worth mentioning: a ratio of 1.2 is probably more significant when fitting time takes hours vs a few ms.

For now I can't say I'm convinced that such change is needed because it's quite subtle; there's a solid chance we may break/revert it in the future without having a way of catching it, and the benefits are unclear so far.


BTW, if the plan is to improve the hist-GBDTs, one (not-so-low) hanging fruit is to make early stopping on scorers faster: right now, to compute the score at iteration i, we need to re-compute all the predictions form iteration 0 to iteration i. It's really slow but that's what the current design of the scorers forces us to do. It'd be great if we could make that faster by caching the predictions, just like we do when early stopping is checked on the loss.

@lorentzenchr
Copy link
Member Author

The same with early_stopping=False:
image

Absolute timing of this PR (see also detail section of the benchmark above):
image

Comparing training log loss for n_features=20 gives:

  • n_samples = 1e+04: logloss PR/logloss master = 9.99999999e-01
  • n_samples = 1e+05: logloss PR/logloss master = 9.99999997e-01
  • n_samples = 1e+06: logloss PR/logloss master = 9.98733125e-01

How can this PR get better training error?

Conclusion: Benchmarking is tricky. Maybe I'll rerun again.

@lorentzenchr
Copy link
Member Author

@NicolasHug Thanks for the tip. If time and climbing expertise permits, I'll have a look at the other not so low hanging fruits 🍎

@ogrisel
Copy link
Member

ogrisel commented Oct 26, 2020

How can this PR get better training error?

Maybe just luck in the rounding errors. The difference does not seems that big.

What I do not understand is why you observe a speed up with n_samples >1e5 and early_stopping = False but a slow down with early_stopping = True while the training loss is either the same or even slightly better with you code change.

@ogrisel
Copy link
Member

ogrisel commented Oct 26, 2020

I would love to see the same benchmark repeated 5 times or more with the mean or trimmed mean time + std error bars to be confident that this change is really an improvement.

Base automatically changed from master to main January 22, 2021 10:53
@jjerphan
Copy link
Member

@lorentzenchr: what is the current status of this PR? Should it be revisited with the recent changes for loss functions?

@lorentzenchr
Copy link
Member Author

@lorentzenchr: what is the current status of this PR? Should it be revisited with the recent changes for loss functions?

I'm not sure, I think we can close it.
In the new loss function module, the single data point functions all work with double precision (as before in HGBT).

  • Therefore, I guess the timings wouldn't change.
  • On top, I'm not sure adding 32bit float would be that easy without code bloat. (I would be interested to know if I'm mistaken.)

Given what @NicolasHug said:

BTW, if the plan is to improve the hist-GBDTs, one (not-so-low) hanging fruit is to make early stopping on scorers faster: right now, to compute the score at iteration i, we need to re-compute all the predictions form iteration 0 to iteration i. It's really slow but that's what the current design of the scorers forces us to do. It'd be great if we could make that faster by caching the predictions, just like we do when early stopping is checked on the loss.

my conclusion is closing this an focusing on the larger bottlenecks instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants