-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] PERF compute float32 grad and hess in HGBT #18659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BenchmarkEdit: For better benchmarks, see comments/posts below.
Running `python scikit-learn/benchmarks/bench_hist_gradient_boosting.py`
Scikit-Learn Version 0.23Data size: 1000 samples train, 1000 samples test. THIS PRData size: 1000 samples train, 1000 samples test. |
@@ -153,11 +156,12 @@ def _update_gradients_hessians_categorical_crossentropy( | |||
int n_samples = raw_predictions.shape[1] | |||
int k # class index | |||
int i # sample index | |||
Y_DTYPE_C sw | |||
G_H_DTYPE_C sw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change sw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, in line 187-188, gradients[k, i] = XXX * sw
, the multiplication on the right-hand side would be a double, which is then down-casted to float32 for assignment to the left-hand side.
Why doing the multiplication in double precision, when it's down-casted to float anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doing the multiplication in double precision, when it's down-casted to float anyway?
To avoid too many casts since casting isn't free. Though I don't know enough of the details to understand how costly they are, and how/when exactly they happen.
Note that according to this logic, y_true in _update_gradients_hessians_categorical_crossentropy
should be converted to float too since we do gradients[i] = p_i - y_true[i]
(I don't think we should, just noting a discrepancy.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y_true
is a memory view, i.e. double[:]
. I guess there is no implicit type conversion (possible). Accoding to this answer, the underlying type is a struct containing pointers. Providing a float32 array as argument throws an error instead of casting.
sum_exps += p[i, k] | ||
|
||
for k in range(prediction_dim): | ||
p[i, k] /= sum_exps | ||
|
||
|
||
cdef inline Y_DTYPE_C _cexpit(const Y_DTYPE_C x) nogil: | ||
cdef inline G_H_DTYPE_C _cexpit(const G_H_DTYPE_C x) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now expecting a float but what is passed to _cexpit is still a double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand C correctly, in particular type casting, the argument of _cexpit
will be implicitly converted to the type as specified in function definition, which is now float32.
assert_allclose(loss.inverse_link_function(optimum), y_true, atol=2e-5) | ||
assert_allclose(func(optimum), 0, atol=2e-11) | ||
assert_allclose(get_gradients(y_true, optimum), 0, atol=2e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are quite big differences!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are. This PR loses some precision of gradients and hessian. This stems from calling expf
instead of exp
. This is somehow amplified the solver newton
. I had to reduce it's tolerance, the lower tol in the assertions is a consequence.
The big question is if this has an impact on fitting (convergence and attained training loss).
@lorentzenchr could you provide benchmarks over many iterations (ideally in a table or a graph, with stddev ;) ) The Hist-GBDT code is pretty unstable w.r.t. computation time and reporting just one run isn't accurate enough. Could you also check what lightGBM does? |
@NicolasHug Thanks for you fast review. I'll provide better benchmarks as soon as I find time. |
I doubt that loss gradient/hessian computations are performance critical. This would need to be checked via some profiling, e.g. using https://github.com/emeryberger/scalene or https://github.com/benfred/py-spy ) but I wouldn't mind making the code more consistent. However the loss of precision induced by expf is quite important...
I agree this is a good idea :) If we decide to keep the gradient values in float32 but upcast locally to float64 for some intermediate computations that are numerically sensitive we should add inline comments to explain that this is intentional. |
@@ -91,14 +91,14 @@ def fprime2(x: np.ndarray) -> np.ndarray: | |||
return get_hessians(y_true, x) | |||
|
|||
optimum = newton(func, x0=x0, fprime=fprime, fprime2=fprime2, | |||
maxiter=70, tol=5e-6) | |||
maxiter=70, tol=5e-7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I am not sure what we should conclude from the fact that more newton iteration converge to the true optimum even with a lower precision gradient computation...
You could use a C/C++ debugger to check what is the concrete |
Thanks for the benchmarks @lorentzenchr I think we should explicitly deactivate early stopping for such benchmarks because it may impact the actual number of iterations, especially if there are differences in how the losses are computed: Also absolute times are worth mentioning: a ratio of 1.2 is probably more significant when fitting time takes hours vs a few ms. For now I can't say I'm convinced that such change is needed because it's quite subtle; there's a solid chance we may break/revert it in the future without having a way of catching it, and the benefits are unclear so far. BTW, if the plan is to improve the hist-GBDTs, one (not-so-low) hanging fruit is to make early stopping on scorers faster: right now, to compute the score at iteration i, we need to re-compute all the predictions form iteration 0 to iteration i. It's really slow but that's what the current design of the scorers forces us to do. It'd be great if we could make that faster by caching the predictions, just like we do when early stopping is checked on the loss. |
@NicolasHug Thanks for the tip. If time and climbing expertise permits, I'll have a look at the other not so low hanging fruits 🍎 |
Maybe just luck in the rounding errors. The difference does not seems that big. What I do not understand is why you observe a speed up with n_samples >1e5 and |
I would love to see the same benchmark repeated 5 times or more with the mean or trimmed mean time + std error bars to be confident that this change is really an improvement. |
@lorentzenchr: what is the current status of this PR? Should it be revisited with the recent changes for loss functions? |
I'm not sure, I think we can close it.
Given what @NicolasHug said:
my conclusion is closing this an focusing on the larger bottlenecks instead. |
What does this implement/fix? Explain your changes.
The
BaseHistGradientBoosting
is using float32 for storing gradiends and hessians. Nevertheless, the computations are done using full float64 and only at the end are they converted to float32.One can gain some perfomance, i.e. less fit time, by doing expensive calculations of exp in float32, too.
This PR basically replaces some Cython calls to
double exp(double)
byfloat expf(float)
.Any other comments?
Some test accuracies must be lowered, too.