Skip to content

ENH HGBT histograms in blocks of features #27168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

This PR adds Cython histogram functions that calculate the histograms of 4 features at once. This way, the gradients and hessians can be used better. This increases performance in particular for single-threaded environments, see below.

Any other comments?

It is unclear to me, why the multi-threaded environments do not improve noticeably. This may be related to #14306.

benchmark

python bench_hist_gradient_boosting_higgsboson.py --n-trees 100
Note that this benchmark usually has quite some fluctuations in reported timings.

  • Single threaded export OMP_NUM_THREADS=1
    • This PR
      Fit 100 trees in 153.223 s, (3100 total leaves)
      Time spent computing histograms: 84.157s
      Time spent finding best splits:  0.382s
      Time spent applying splits:      25.699s
      Time spent predicting:           3.624s
      fitted in 153.408s
      predicted in 24.846s, ROC AUC: 0.8228, ACC: 0.7415
      
    • MAIN
      Fit 100 trees in 172.712 s, (3100 total leaves)
      Time spent computing histograms: 105.948s
      Time spent finding best splits:  0.396s
      Time spent applying splits:      25.401s
      Time spent predicting:           3.687s
      fitted in 172.898s
      predicted in 25.716s, ROC AUC: 0.8228, ACC: 0.7415
      
  • Multi-threaded (4 threads)
    • This PR
      Time spent computing histograms: 38.097s
      Time spent finding best splits:  0.233s
      Time spent applying splits:      8.211s
      Time spent predicting:           1.948s
      fitted in 66.434s
      
    • MAIN
      Time spent computing histograms: 38.640s
      Time spent finding best splits:  0.242s
      Time spent applying splits:      8.316s
      Time spent predicting:           1.986s
      fitted in 66.474s
      

@github-actions
Copy link

github-actions bot commented Aug 25, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e873abf. Link to the linter CI: here

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @lorentzenchr.

I am wondering whether we can improve those already optimized memory access patterns.

Here are a few comments.

Comment on lines +461 to +464
for bin_idx in range(n_bins):
out[feature_idx, bin_idx].sum_gradients = 0.
out[feature_idx, bin_idx].sum_hessians = 0.
out[feature_idx, bin_idx].count = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since integral and positive floating zeros are encoded with null bits, do you think we use memset(3)?

Suggested change
for bin_idx in range(n_bins):
out[feature_idx, bin_idx].sum_gradients = 0.
out[feature_idx, bin_idx].sum_hessians = 0.
out[feature_idx, bin_idx].count = 0
# Fill all the fields in `out[features_idx, :]` with zeros.
memset(&out[feature_idx, :], 0, sizeof(hist_struct) * n_bins)

This suggestion also applies to similar other initializations.

hist_struct [:, ::1] histograms, # OUT
double [:] time_vec,
) noexcept nogil:
"""Compute the histogram for a given feature."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Compute the histogram for a given feature."""
"""Compute the histogram for four features at a time."""

Comment on lines +185 to +186
# Compute histogram for each features
# Do it for 4 features at once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Compute histogram for each features
# Do it for 4 features at once
# Compute the histogram of each feature efficiently (i.e. by groups
# of 4 features) when it is possible and worth it, that is when no
# interaction constraints are used and the number of features is
# at least 4 times as large as the number of threads.

@@ -142,11 +146,23 @@ cdef class HistogramBuilder:
)
bint has_interaction_cst = allowed_features is not None
int n_threads = self.n_threads
unsigned int n_feature_groups = self.n_features // 4
double [:] time_vec = np.zeros(shape=(2,), dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use heap allocation, here? Can't we allocate on the stack and pass their address using pointers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more for benchmarking here. I guess in the unlikely case this PR gets merged, we would remove the timing part.

Comment on lines +160 to +161
if n_feature_groups < n_threads:
n_feature_groups = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if n_feature_groups < n_threads:
n_feature_groups = 0
if n_feature_groups < n_threads:
# It is not worth parallelizing the computation of histograms on
# 4 features at a time in this case.
n_feature_groups = 0

if has_interaction_cst:
feature_idx = allowed_features[f_idx]
else:
feature_idx = f_idx
# Compute histogram of each feature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Compute histogram of each feature
# Compute the histogram of each feature, one feature at a time

const G_H_DTYPE_C [::1] ordered_gradients, # IN
const G_H_DTYPE_C [::1] ordered_hessians, # IN
hist_struct [:, ::1] out, # OUT
unsigned int n_bins=256,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do we strictly want to require the number of bins?

Suggested change
unsigned int n_bins=256,
unsigned int n_bins,

Comment on lines +40 to +41
# operations and perform them all at once. As 2 points may both update the same
# bin, SIMD is not possible here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Moreover, we use Arrays of Structures (namely arrays of hist_struct) for histogram, so SIMD is not possible even if they were to update different values.

Suggested change
# operations and perform them all at once. As 2 points may both update the same
# bin, SIMD is not possible here.
# operations and perform them all at once. As 2 points may both update the same
# bin and updated values aren't contiguous in memory, SIMD is impossible here.

Naive question: could using Structures of Arrays for histograms help with increasing parallelization (both at the level of instructions and with SIMD) and reducing false sharing?

Is there some design decisions in this regards, @NicolasHug? 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SIMD isn't possible, what is that "instruction-level parallelism" that's happening then? (I thought that was SIMD.)

Is there some design decisions in this regards, @NicolasHug? 🙂

IIRC the "design decision" was that we had one implementation for histogram building, saw that light-GBM did it this way (the way it's currently done on main), and we observed massive speed-up when we replicated that. As for array of struct vs struct of arrays: I didn't consider the alternative (I was struggling enough juggling with a bunch of undocumented Cython features).

Basically, anything you can think of is probably worth exploring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ILP]https://en.wikipedia.org/wiki/Instruction-level_parallelism) vs SIMD: The access pattern of the bins prohibits SIMD because we could write to the same bin within the SIMD minibatch (or however it is called). Histograms and SIMD are just not best friends.

Struct of Arrays (SoA) vs Array of Structs (AoS): I played with that without noticing much difference. We would need SoA for SIMD, but you've already read the point above...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try data-structure profiling using those recent changes made to perf(1) to understand access patterns.

Comment on lines +15 to +16
from libc.time cimport time as libc_time
from libc.time cimport time_t, difftime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lorentzenchr lorentzenchr changed the title ENH HGBT histogram feature wise ENH HGBT histograms in blocks of features Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants