-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH make initial binning in HGBT parallel #28064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH make initial binning in HGBT parallel #28064
Conversation
@@ -6,6 +6,7 @@ | |||
approximately the same number of samples. | |||
""" | |||
# Author: Nicolas Hug | |||
import concurrent.futures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason, we never use a ThreadPoolExecutor?
I think loky
is preferred and used as a back-end via joblib
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this here uses ThreadPoolExecutor
, not ProcessPoolExecutor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I think @ogrisel is much more knowledgeable than me to make an educated decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is interesting ; I wonder whether we could abstract it and reuse it in some places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that loky is only useful for process-based parallelism. For Python-level thread-based parallelism, ThreadPoolExecutor
is perfectly fine.
We traditionally used Internally, joblib with the "threading" backend uses In cases where we want to hard-code the uses of threads, I think that
Could you please post the results of a quick ad-hoc timeit for that step alone to quantify the speed-up out of curiosity? Note that So when |
Note: prior to making the code more complex for parallelization, maybe we should investigate optimizing the single-threaded variant: at the moment we sort each (subsampled) columns data twice: Once in |
Edit: Update (long) after merge of #28102, commit 9c5e16d
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
n_samples, n_features = 200_000, 20
n_bins = 256
X, y = make_classification(n_samples=n_samples, n_features=n_features)
categorical_remapped = np.zeros(n_features, dtype=bool)
bin_mapper = _BinMapper(
n_bins=n_bins,
is_categorical=categorical_remapped,
known_categories=None,
random_state=1,
n_threads=1,
)
%timeit bin_mapper.fit(X) # 378 ms ± 2.57 ms (old 465 ms ± 3.08 ms)
%timeit bin_mapper.fit_transform(X) # 602 ms ± 8.5 ms (old 682 ms ± 4.33 ms)
bin_mapper = _BinMapper(
n_bins=n_bins,
is_categorical=categorical_remapped,
known_categories=None,
random_state=1,
n_threads=4,
)
%timeit bin_mapper.fit(X) # 103 ms ± 1.06 ms (old 137 ms ± 3.04 ms)
%timeit bin_mapper.fit_transform(X) # 188 ms ± 9.76 ms (old 227 ms ± 3.85 ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @lorentzenchr
So this broke Pyodide, very likely because you can not start a thread in Pyodide, see build log Stack-trace
The work-around is to check whether we are inside Pyodide and not use multi-threading in this case? I guess Pyodide is the only case where you can not create threads. I don't have a strong opinion on using joblib for multi-threading, but using |
Reference Issues/PRs
None
What does this implement/fix? Explain your changes.
This PR make the the finding of thresholds/quantiles in
_BinMapper.fit
parallel withwith concurrent.futures.ThreadPoolExecutor
, see https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. This works indeed one of the recommended ways to make numpy parallel in pure Python, see https://numpy.org/doc/stable/reference/random/multithreading.html.Any other comments?
Is there a reason, we never use a ThreadPoolExecutor?
The gain in execution speed is low as only a fraction of the fit time is spend finding the thresholds.