Skip to content

ENH make initial binning in HGBT parallel #28064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

This PR make the the finding of thresholds/quantiles in _BinMapper.fit parallel with with concurrent.futures.ThreadPoolExecutor, see https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. This works indeed one of the recommended ways to make numpy parallel in pure Python, see https://numpy.org/doc/stable/reference/random/multithreading.html.

Any other comments?

Is there a reason, we never use a ThreadPoolExecutor?

The gain in execution speed is low as only a fraction of the fit time is spend finding the thresholds.

Copy link

github-actions bot commented Jan 4, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5b171d0. Link to the linter CI: here

@@ -6,6 +6,7 @@
approximately the same number of samples.
"""
# Author: Nicolas Hug
import concurrent.futures
Copy link
Member

@jjerphan jjerphan Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason, we never use a ThreadPoolExecutor?

I think loky is preferred and used as a back-end via joblib instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this here uses ThreadPoolExecutor, not ProcessPoolExecutor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

I think @ogrisel is much more knowledgeable than me to make an educated decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is interesting ; I wonder whether we could abstract it and reuse it in some places.

Copy link
Member

@ogrisel ogrisel Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that loky is only useful for process-based parallelism. For Python-level thread-based parallelism, ThreadPoolExecutor is perfectly fine.

@ogrisel
Copy link
Member

ogrisel commented Jan 8, 2024

Is there a reason, we never use a ThreadPoolExecutor?

We traditionally used joblib.Parallel(prefer="threads") (overridable selection of the threading backend) or joblib.Parallel(backend="threading") (non-overriable selection of the threading backend) for Python level thread-based, parallelism in scikit-learn.

Internally, joblib with the "threading" backend uses multiprocessing.ThreadPool which is the ancestor of concurrent.futures.ThreadPoolExecutor but they are very similar. I find the API of concurrent.futures.ThreadPoolExecutor cleaner than multiprocessing.ThreadPool, but joblib users should not see any difference.

In cases where we want to hard-code the uses of threads, I think that ThreadPoolExecutor and joblib.Parallel(backend="threading") are both valid options. ThreadPoolExecutor might be slightly lower-overhead (no joblib backend negotiation abstractions) and offers a richer API than joblib that makes it possible to consumes the results on the fly (with as_completed) while this is only possible with the dev version of joblib (via the return_as="unordered_generator" new parameter).

The gain in execution speed is low as only a fraction of the fit time is spend finding the thresholds.

Could you please post the results of a quick ad-hoc timeit for that step alone to quantify the speed-up out of curiosity?

Note that _BinMapper.fit finds the thresholds on a bounded random subsample of the training set (while _BinMapper.transform transforms the full training set but is already using OpenMP threads in its Cython code.

So when n_samples is large (in the millions or more), I expect _BinMapper.fit to be negligible compared to _BinMapper.transform and the latter should already be near-optimal in terms of parallelism.

@ogrisel
Copy link
Member

ogrisel commented Jan 8, 2024

Note: prior to making the code more complex for parallelization, maybe we should investigate optimizing the single-threaded variant: at the moment we sort each (subsampled) columns data twice:

Once in np.unique and once in np.percentile. Maybe there is a way to sort only once instead.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Jan 11, 2024

Edit: Update (long) after merge of #28102, commit 9c5e16d

n_threads fit time fit_transformtime
1 378 ms 602 ms
4 103 ms 188 ms
import numpy as np

from sklearn.datasets import make_classification
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper


n_samples, n_features = 200_000, 20
n_bins = 256

X, y = make_classification(n_samples=n_samples, n_features=n_features)
categorical_remapped = np.zeros(n_features, dtype=bool)

bin_mapper = _BinMapper(
    n_bins=n_bins,
    is_categorical=categorical_remapped,
    known_categories=None,
    random_state=1,
    n_threads=1,
)
%timeit bin_mapper.fit(X)  # 378 ms ± 2.57 ms (old 465 ms ± 3.08 ms)
%timeit bin_mapper.fit_transform(X)  # 602 ms ± 8.5 ms (old 682 ms ± 4.33 ms)

bin_mapper = _BinMapper(
    n_bins=n_bins,
    is_categorical=categorical_remapped,
    known_categories=None,
    random_state=1,
    n_threads=4,
)
%timeit bin_mapper.fit(X)  # 103 ms ± 1.06 ms (old 137 ms ± 3.04 ms)
%timeit bin_mapper.fit_transform(X)  #  188 ms ± 9.76 ms (old 227 ms ± 3.85 ms)

@lorentzenchr lorentzenchr added Quick Review For PRs that are quick to review Performance labels May 23, 2024
@lorentzenchr lorentzenchr added this to the 1.6 milestone May 23, 2024
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @lorentzenchr

@OmarManzoor OmarManzoor added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jun 28, 2024
@OmarManzoor OmarManzoor merged commit 2107404 into scikit-learn:main Jun 28, 2024
31 checks passed
@lorentzenchr lorentzenchr deleted the hgbt_parallel_quantile_binning branch June 28, 2024 15:21
@lesteve
Copy link
Member

lesteve commented Jul 1, 2024

So this broke Pyodide, very likely because you can not start a thread in Pyodide, see build log

Stack-trace
_____________________ test_bin_mapper_n_features_transform _____________________

    def test_bin_mapper_n_features_transform():
>       mapper = _BinMapper(n_bins=42, random_state=42).fit(DATA)
        w          = <concurrent.futures.thread._WorkItem object at 0x111cf120>
/lib/python312.zip/concurrent/futures/thread.py:202: in _adjust_thread_count
    t.start()
        num_threads = 0
        self       = <concurrent.futures.thread.ThreadPoolExecutor object at 0x113cbb50>
        t          = <Thread(ThreadPoolExecutor-0_0, initial)>
        thread_name = 'ThreadPoolExecutor-0_0'
        weakref_cb = <function ThreadPoolExecutor._adjust_thread_count.<locals>.weakref_cb at 0x11454080>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Thread(ThreadPoolExecutor-0_0, initial)>

    def start(self):
        """Start the thread's activity.
    
        It must be called at most once per thread object. It arranges for the
        object's run() method to be invoked in a separate thread of control.
    
        This method will raise a RuntimeError if called more than once on the
        same thread object.
    
        """
        if not self._initialized:
            raise RuntimeError("thread.__init__() not called")
    
        if self._started.is_set():
            raise RuntimeError("threads can only be started once")
    
        with _active_limbo_lock:
            _limbo[self] = self
        try:
>           _start_new_thread(self._bootstrap, ())
E           RuntimeError: can't start new thread

self       = <Thread(ThreadPoolExecutor-0_0, initial)>

/lib/python312.zip/threading.py:992: RuntimeError

The work-around is to check whether we are inside Pyodide and not use multi-threading in this case? I guess Pyodide is the only case where you can not create threads.

I don't have a strong opinion on using joblib for multi-threading, but using joblib.Parallel would have switched to n_jobs=1 inside Pyodide. Side-comment: I think returns_as='unordered_generator' is in joblib 1.4 and not only in the joblib dev version see changelog. Having said that not sure we want to require joblib>=1.4, released April 2024, to use this feature.

@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ensemble Performance Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants