ENH HGBT histograms in blocks of features #27168

lorentzenchr · 2023-08-25T17:09:20Z

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

This PR adds Cython histogram functions that calculate the histograms of 4 features at once. This way, the gradients and hessians can be used better. This increases performance in particular for single-threaded environments, see below.

Any other comments?

It is unclear to me, why the multi-threaded environments do not improve noticeably. This may be related to #14306.

benchmark

python bench_hist_gradient_boosting_higgsboson.py --n-trees 100
Note that this benchmark usually has quite some fluctuations in reported timings.

Single threaded export OMP_NUM_THREADS=1

This PR

Fit 100 trees in 153.223 s, (3100 total leaves)
Time spent computing histograms: 84.157s
Time spent finding best splits:  0.382s
Time spent applying splits:      25.699s
Time spent predicting:           3.624s
fitted in 153.408s
predicted in 24.846s, ROC AUC: 0.8228, ACC: 0.7415

MAIN

Fit 100 trees in 172.712 s, (3100 total leaves)
Time spent computing histograms: 105.948s
Time spent finding best splits:  0.396s
Time spent applying splits:      25.401s
Time spent predicting:           3.687s
fitted in 172.898s
predicted in 25.716s, ROC AUC: 0.8228, ACC: 0.7415

Multi-threaded (4 threads)

This PR

Time spent computing histograms: 38.097s
Time spent finding best splits:  0.233s
Time spent applying splits:      8.211s
Time spent predicting:           1.948s
fitted in 66.434s

MAIN

Time spent computing histograms: 38.640s
Time spent finding best splits:  0.242s
Time spent applying splits:      8.316s
Time spent predicting:           1.986s
fitted in 66.474s

github-actions · 2023-08-25T17:10:52Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: e873abf. Link to the linter CI: here}

jjerphan

Thank you, @lorentzenchr.

I am wondering whether we can improve those already optimized memory access patterns.

Here are a few comments.

jjerphan · 2023-12-02T17:55:36Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

+    for bin_idx in range(n_bins):
+        out[feature_idx, bin_idx].sum_gradients = 0.
+        out[feature_idx, bin_idx].sum_hessians = 0.
+        out[feature_idx, bin_idx].count = 0


Since integral and positive floating zeros are encoded with null bits, do you think we use memset(3)?

Suggested change

for bin_idx in range(n_bins):

out[feature_idx, bin_idx].sum_gradients = 0.

out[feature_idx, bin_idx].sum_hessians = 0.

out[feature_idx, bin_idx].count = 0

# Fill all the fields in `out[features_idx, :]` with zeros.

memset(&out[feature_idx, :], 0, sizeof(hist_struct) * n_bins)

This suggestion also applies to similar other initializations.

jjerphan · 2023-12-02T18:08:50Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

+        hist_struct [:, ::1] histograms,          # OUT
+        double [:] time_vec,
+    ) noexcept nogil:
+        """Compute the histogram for a given feature."""


Suggested change

"""Compute the histogram for a given feature."""

"""Compute the histogram for four features at a time."""

jjerphan · 2023-12-02T18:16:53Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

+            # Compute histogram for each features
+            # Do it for 4 features at once


Suggested change

# Compute histogram for each features

# Do it for 4 features at once

# Compute the histogram of each feature efficiently (i.e. by groups

# of 4 features) when it is possible and worth it, that is when no

# interaction constraints are used and the number of features is

# at least 4 times as large as the number of threads.

jjerphan · 2023-12-02T18:18:25Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

@@ -142,11 +146,23 @@ cdef class HistogramBuilder:
            )
            bint has_interaction_cst = allowed_features is not None
            int n_threads = self.n_threads
+            unsigned int n_feature_groups = self.n_features // 4
+            double [:] time_vec = np.zeros(shape=(2,), dtype=np.float64)


Why do we use heap allocation, here? Can't we allocate on the stack and pass their address using pointers?

That's more for benchmarking here. I guess in the unlikely case this PR gets merged, we would remove the timing part.

jjerphan · 2023-12-02T18:20:21Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

+        if n_feature_groups < n_threads:
+            n_feature_groups = 0


Suggested change

if n_feature_groups < n_threads:

n_feature_groups = 0

if n_feature_groups < n_threads:

# It is not worth parallelizing the computation of histograms on

# 4 features at a time in this case.

n_feature_groups = 0

jjerphan · 2023-12-02T18:25:00Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

                if has_interaction_cst:
                    feature_idx = allowed_features[f_idx]
                else:
                    feature_idx = f_idx
+                # Compute histogram of each feature


Suggested change

# Compute histogram of each feature

# Compute the histogram of each feature, one feature at a time

jjerphan · 2023-12-02T18:31:10Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

+    const G_H_DTYPE_C [::1] ordered_gradients,    # IN
+    const G_H_DTYPE_C [::1] ordered_hessians,     # IN
+    hist_struct [:, ::1] out,                     # OUT
+    unsigned int n_bins=256,


Nit: do we strictly want to require the number of bins?

Suggested change

unsigned int n_bins=256,

unsigned int n_bins,

jjerphan · 2023-12-02T18:36:58Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

+#   operations and perform them all at once. As 2 points may both update the same
+#   bin, SIMD is not possible here.


That's right. Moreover, we use Arrays of Structures (namely arrays of hist_struct) for histogram, so SIMD is not possible even if they were to update different values.

Suggested change

# operations and perform them all at once. As 2 points may both update the same

# bin, SIMD is not possible here.

# operations and perform them all at once. As 2 points may both update the same

# bin and updated values aren't contiguous in memory, SIMD is impossible here.

Naive question: could using Structures of Arrays for histograms help with increasing parallelization (both at the level of instructions and with SIMD) and reducing false sharing?

Is there some design decisions in this regards, @NicolasHug? 🙂

If SIMD isn't possible, what is that "instruction-level parallelism" that's happening then? (I thought that was SIMD.)

Is there some design decisions in this regards, @NicolasHug? 🙂

IIRC the "design decision" was that we had one implementation for histogram building, saw that light-GBM did it this way (the way it's currently done on main), and we observed massive speed-up when we replicated that. As for array of struct vs struct of arrays: I didn't consider the alternative (I was struggling enough juggling with a bunch of undocumented Cython features).

Basically, anything you can think of is probably worth exploring

[ILP]https://en.wikipedia.org/wiki/Instruction-level_parallelism) vs SIMD: The access pattern of the bins prohibits SIMD because we could write to the same bin within the SIMD minibatch (or however it is called). Histograms and SIMD are just not best friends.

Struct of Arrays (SoA) vs Array of Structs (AoS): I played with that without noticing much difference. We would need SoA for SIMD, but you've already read the point above...

We could try data-structure profiling using those recent changes made to perf(1) to understand access patterns.

jjerphan · 2023-12-02T18:39:15Z

sklearn/ensemble/_hist_gradient_boosting/histogram.pyx

+from libc.time cimport time as libc_time
+from libc.time cimport time_t, difftime


lorentzenchr added 11 commits August 17, 2023 12:24

ENH histograms over 4 features at a time

7d9f489

DEBUG add detailed timings

748464e

ENH block over samples

379dd85

ENH order of gradient summation

de6cd44

CLN remove block over samples and clean code

6d3c7a7

CLN code formatting

6844897

ENH move histogram zeroing to feature-wise histogram function

07eef3c

ENH add no_hessian variants

2ef631f

TST fix test_histogram

8961908

DOC comment about loop unrolling is for instruction level parallelism

5b963ac

ENH feature wise histo only if n_threads >= n_feature_groups

afa6255

github-actions bot added cython module:ensemble labels Aug 25, 2023

Merge branch 'main' into hgbt_histogram_feature_wise

e873abf

jjerphan reviewed Dec 2, 2023

View reviewed changes

lorentzenchr changed the title ~~ENH HGBT histogram feature wise~~ ENH HGBT histograms in blocks of features Mar 20, 2024

ogrisel mentioned this pull request Jan 17, 2025

HistGradientBoostingClassifier/Regressor 15x slowdown on small data problems compared to disabled OpenMP threading #30662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH HGBT histograms in blocks of features #27168

ENH HGBT histograms in blocks of features #27168

lorentzenchr commented Aug 25, 2023

github-actions bot commented Aug 25, 2023 •

edited

Loading

jjerphan left a comment

jjerphan Dec 2, 2023

jjerphan Dec 2, 2023

jjerphan Dec 2, 2023

jjerphan Dec 2, 2023

lorentzenchr Dec 4, 2023

jjerphan Dec 2, 2023

jjerphan Dec 2, 2023

jjerphan Dec 2, 2023

jjerphan Dec 2, 2023

NicolasHug Dec 4, 2023

lorentzenchr Dec 4, 2023

jjerphan Jan 29, 2024

jjerphan Dec 2, 2023

	"""Compute the histogram for a given feature."""
	"""Compute the histogram for four features at a time."""

		# Compute histogram for each features
		# Do it for 4 features at once

-            # Compute histogram for each features
-            # Do it for 4 features at once
+            # Compute the histogram of each feature efficiently (i.e. by groups
+            # of 4 features) when it is possible and worth it, that is when no
+            # interaction  constraints are used and the number of features is
+            # at least 4 times as large as the number of threads.

	# Compute histogram of each feature
	# Compute the histogram of each feature, one feature at a time

		# operations and perform them all at once. As 2 points may both update the same
		# bin, SIMD is not possible here.

		from libc.time cimport time as libc_time
		from libc.time cimport time_t, difftime

ENH HGBT histograms in blocks of features #27168

Are you sure you want to change the base?

ENH HGBT histograms in blocks of features #27168

Conversation

lorentzenchr commented Aug 25, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

benchmark

github-actions bot commented Aug 25, 2023 • edited Loading

✔️ Linting Passed

jjerphan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Aug 25, 2023 •

edited

Loading