-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH HGBT histograms in blocks of features #27168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH HGBT histograms in blocks of features #27168
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @lorentzenchr.
I am wondering whether we can improve those already optimized memory access patterns.
Here are a few comments.
for bin_idx in range(n_bins): | ||
out[feature_idx, bin_idx].sum_gradients = 0. | ||
out[feature_idx, bin_idx].sum_hessians = 0. | ||
out[feature_idx, bin_idx].count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since integral and positive floating zeros are encoded with null bits, do you think we use memset(3)
?
for bin_idx in range(n_bins): | |
out[feature_idx, bin_idx].sum_gradients = 0. | |
out[feature_idx, bin_idx].sum_hessians = 0. | |
out[feature_idx, bin_idx].count = 0 | |
# Fill all the fields in `out[features_idx, :]` with zeros. | |
memset(&out[feature_idx, :], 0, sizeof(hist_struct) * n_bins) |
This suggestion also applies to similar other initializations.
hist_struct [:, ::1] histograms, # OUT | ||
double [:] time_vec, | ||
) noexcept nogil: | ||
"""Compute the histogram for a given feature.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Compute the histogram for a given feature.""" | |
"""Compute the histogram for four features at a time.""" |
# Compute histogram for each features | ||
# Do it for 4 features at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Compute histogram for each features | |
# Do it for 4 features at once | |
# Compute the histogram of each feature efficiently (i.e. by groups | |
# of 4 features) when it is possible and worth it, that is when no | |
# interaction constraints are used and the number of features is | |
# at least 4 times as large as the number of threads. |
@@ -142,11 +146,23 @@ cdef class HistogramBuilder: | |||
) | |||
bint has_interaction_cst = allowed_features is not None | |||
int n_threads = self.n_threads | |||
unsigned int n_feature_groups = self.n_features // 4 | |||
double [:] time_vec = np.zeros(shape=(2,), dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use heap allocation, here? Can't we allocate on the stack and pass their address using pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's more for benchmarking here. I guess in the unlikely case this PR gets merged, we would remove the timing part.
if n_feature_groups < n_threads: | ||
n_feature_groups = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if n_feature_groups < n_threads: | |
n_feature_groups = 0 | |
if n_feature_groups < n_threads: | |
# It is not worth parallelizing the computation of histograms on | |
# 4 features at a time in this case. | |
n_feature_groups = 0 |
if has_interaction_cst: | ||
feature_idx = allowed_features[f_idx] | ||
else: | ||
feature_idx = f_idx | ||
# Compute histogram of each feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Compute histogram of each feature | |
# Compute the histogram of each feature, one feature at a time |
const G_H_DTYPE_C [::1] ordered_gradients, # IN | ||
const G_H_DTYPE_C [::1] ordered_hessians, # IN | ||
hist_struct [:, ::1] out, # OUT | ||
unsigned int n_bins=256, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do we strictly want to require the number of bins?
unsigned int n_bins=256, | |
unsigned int n_bins, |
# operations and perform them all at once. As 2 points may both update the same | ||
# bin, SIMD is not possible here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Moreover, we use Arrays of Structures (namely arrays of hist_struct
) for histogram, so SIMD is not possible even if they were to update different values.
# operations and perform them all at once. As 2 points may both update the same | |
# bin, SIMD is not possible here. | |
# operations and perform them all at once. As 2 points may both update the same | |
# bin and updated values aren't contiguous in memory, SIMD is impossible here. |
Naive question: could using Structures of Arrays for histograms help with increasing parallelization (both at the level of instructions and with SIMD) and reducing false sharing?
Is there some design decisions in this regards, @NicolasHug? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If SIMD isn't possible, what is that "instruction-level parallelism" that's happening then? (I thought that was SIMD.)
Is there some design decisions in this regards, @NicolasHug? 🙂
IIRC the "design decision" was that we had one implementation for histogram building, saw that light-GBM did it this way (the way it's currently done on main
), and we observed massive speed-up when we replicated that. As for array of struct vs struct of arrays: I didn't consider the alternative (I was struggling enough juggling with a bunch of undocumented Cython features).
Basically, anything you can think of is probably worth exploring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ILP]https://en.wikipedia.org/wiki/Instruction-level_parallelism) vs SIMD: The access pattern of the bins prohibits SIMD because we could write to the same bin within the SIMD minibatch (or however it is called). Histograms and SIMD are just not best friends.
Struct of Arrays (SoA) vs Array of Structs (AoS): I played with that without noticing much difference. We would need SoA for SIMD, but you've already read the point above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could try data-structure profiling using those recent changes made to perf(1)
to understand access patterns.
from libc.time cimport time as libc_time | ||
from libc.time cimport time_t, difftime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Reference Issues/PRs
None
What does this implement/fix? Explain your changes.
This PR adds Cython histogram functions that calculate the histograms of 4 features at once. This way, the gradients and hessians can be used better. This increases performance in particular for single-threaded environments, see below.
Any other comments?
It is unclear to me, why the multi-threaded environments do not improve noticeably. This may be related to #14306.
benchmark
python bench_hist_gradient_boosting_higgsboson.py --n-trees 100
Note that this benchmark usually has quite some fluctuations in reported timings.
export OMP_NUM_THREADS=1