Skip to content

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

This PR improves the (compute/time) performance of HGBT a little bit by reducing a sum over n_samples in each new root node to a sum over n_bins.

Any other comments?

This also makes the Cython function sum_parralel obsolete.

Copy link

github-actions bot commented Feb 21, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 39f93b4. Link to the linter CI: here

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codewise, I think this looks better.

Do you have a quick benchmark to show the slight improvement?

@lorentzenchr
Copy link
Member Author

Do you have a quick benchmark to show the slight improvement?

Some very rough benchmark numbers running python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100

main branch

Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 4.917 s
Fitting gradient boosted rounds:
Fit 100 trees in 62.501 s, (3100 total leaves)
Time spent computing histograms: 36.928s
Time spent finding best splits:  0.207s
Time spent applying splits:      8.364s
Time spent predicting:           1.833s
fitted in 62.696s
predicted in 7.975s, ROC AUC: 0.8228, ACC: 0.7415

PR branch

Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 4.751 s
Fitting gradient boosted rounds:
Fit 100 trees in 60.776 s, (3100 total leaves)
Time spent computing histograms: 36.435s
Time spent finding best splits:  0.204s
Time spent applying splits:      8.070s
Time spent predicting:           1.888s
fitted in 60.970s
predicted in 8.478s, ROC AUC: 0.8228, ACC: 0.7415

While the difference is within the variation, running several times confirms the slight difference.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Feb 23, 2025
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @lorentzenchr

I just have one question, otherwise looks good.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @lorentzenchr

@OmarManzoor OmarManzoor merged commit ada9947 into scikit-learn:main Feb 25, 2025
37 checks passed
@lorentzenchr lorentzenchr deleted the hgbt_init_root_node branch February 25, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants