Fix for making the initial binning in HGBT parallel #29386

OmarManzoor · 2024-07-02T13:24:22Z

Reference Issues/PRs

Follow up of #28064

What does this implement/fix? Explain your changes.

Attempts to fix an error within Pyodide that occurred due to the introduction of parallelisation in the initial binning in HGBT.
Uses Joblib with threading backend instead of ThreadPoolExecutor.

Any other comments?

CC: @lesteve will this work?

github-actions · 2024-07-02T13:25:39Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 84eedc3. Link to the linter CI: here}

OmarManzoor · 2024-07-02T13:31:01Z

Current timings

`n_threads`	`fit` time	`fit_transform`time
1	316 ms ± 405 μs	417 ms ± 11.8 ms
4	97.9 ms ± 586 μs	125 ms ± 743 μs

Timings from the previous PR

`n_threads`	`fit` time	`fit_transform`time
1	378 ms ± 2.57 ms (old 465 ms ± 3.08 ms)	602 ms ± 8.5 ms (old 682 ms ± 4.33 ms)
4	103 ms ± 1.06 ms (old 137 ms ± 3.04 ms)	188 ms ± 9.76 ms (old 227 ms ± 3.85 ms)

lesteve · 2024-07-02T13:52:47Z

CC: @lesteve will this work?

It should I think, I pushed a commit with [pyodide] to trigger the Pyodide build and see what happens.

FYI we were discussing this with other maintainers and here is a quick summary:

here is Olivier's comment about why it may make sense to use concurrent.futures directly ENH make initial binning in HGBT parallel #28064 (comment)
using joblib would kind of be more consistent with the rest of the codebase, but there may be some overhead. To be honest, benchmarks are hard but it is not like ENH make initial binning in HGBT parallel #28064 made a strong effort at benchmarking. Also, doing micro-benchmarkings, may not be a super useful use of our time ...
ENH make initial binning in HGBT parallel #28064 used as_completed to act on the results as soon as they come back. joblib.Parallel will wait that all the task finish before doing anything with the results. joblib 1.4 has returns_as='unordered_generator', see changelog that would do something similar. This feature is kind of new (and as such has probably not been battle-tested yet) and we probably don't want to require joblib>=1.4, released April 2024, just to use it for this particular use case.

lesteve · 2024-07-02T15:39:22Z

So as expected Pyodide CI build is green.

I guess now comes the more complicated decision about what do we do about joblib with backend='threading' vs concurrent.futures?

Personally I would slightly lean towards sticking with joblib. It does feel like we are going to recreate a nano-joblib inside scikit-learn if we go down the concurrent.futures route (but this may still be manageable, not sure) for example:

should n_threads=1 mean serial as in joblib, or one worker thread as was implemented in ENH make initial binning in HGBT parallel #28064 (which is what is causing the issue with Pyodide).
should we suppport n_threads=-1 like joblib does? I think this will not happen here because there is a n_threads = _openmp_effective_n_threads(self.n_threads) before reaching the _BinMapper.fit code but you could imagine having the issue in another place where you need this additional logic rather than passing n_jobs to joblib.Parallel and not thinking about it too much.

Let's ping in no particular order @ogrisel and @jeremiedbb (who were part of the discussion I tried to make a summary of in #29386 (comment)) as well as @betatim @lorentzenchr and @jjerphan (who were involved in the original PR #28064).

The alternative to using joblib is not doing things in parallel if n_threads == 1. I don't think there is a good reason to create a worker thread when n_threads=1 outside of Pyodide. If we insist we want a "we are inside Pyodide" check, see their doc.

jeremiedbb · 2024-07-03T08:19:24Z

So according to #29386 (comment), using joblib doesn't comes with an overhead, it's the opposite even. So I'm even more in favor of sticking with joblib and not introduce a new unnecessary pattern.

Maybe we could add a comment to use the return generator feature from joblib when 1.4 is our min version, but I'm not sure it would have a visible impact given that this task is already a small part of HGBT in terms of time spent.

lesteve · 2024-07-03T15:31:23Z

OK let's merge this, sticking to joblib seems more conservative/less controversial for now and avoids the Pyodide issue.

We can potentially revisit using concurrent.futures in a separate PR with maybe more detailed benchmarks and discussion to motivate it.

lesteve · 2024-07-03T15:32:32Z

Thanks @OmarManzoor for the fix!

lorentzenchr · 2024-07-03T19:29:28Z

not introduce a new unnecessary pattern

While I recognize that this PR fixes a bug, this new pattern was a 1-1 copy-paste from python standard library, see https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example, and as a developer I find it quite strange to disfavor that.

The impact on timing is negligible anyway.

lesteve · 2024-07-04T05:04:24Z

I don't strongly disagree, I think this should be discussed in a separate dedicated issue to try to have the conversation in a single place rather than spread in multiple PRs right now ...

jeremiedbb · 2024-07-04T09:00:47Z

this new pattern was a 1-1 copy-paste from python standard library

@lorentzenchr, I was not saying that it's not a valid pattern or that it's new in the general python ecosystem. It's new in scikit-learn, because so far joblib was used for both multi-processing and multi-threading situations. I don't know the historical reason, maybe to have a common syntax. So what I'm saying is that I'd rather stick with one pattern unless it brings a net benefit.

…#29386) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

Fix for making the initial binning in HGBT parallel

6a35375

github-actions bot added the module:ensemble label Jul 2, 2024

[azure parallel] [pyodide]

84eedc3

jeremiedbb approved these changes Jul 3, 2024

View reviewed changes

lesteve merged commit 82404ba into scikit-learn:main Jul 3, 2024
31 checks passed

OmarManzoor deleted the fix_for_parallel_binning branch July 3, 2024 15:35

snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024

FIX Use joblib Parallel for the initial binning in HGBT (scikit-learn…

378b34f

…#29386) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix for making the initial binning in HGBT parallel #29386

Fix for making the initial binning in HGBT parallel #29386

Uh oh!

OmarManzoor commented Jul 2, 2024

Uh oh!

github-actions bot commented Jul 2, 2024 •

edited

Loading

Uh oh!

OmarManzoor commented Jul 2, 2024

Uh oh!

lesteve commented Jul 2, 2024 •

edited

Loading

Uh oh!

lesteve commented Jul 2, 2024 •

edited

Loading

Uh oh!

jeremiedbb commented Jul 3, 2024

Uh oh!

lesteve commented Jul 3, 2024

Uh oh!

Uh oh!

lesteve commented Jul 3, 2024

Uh oh!

lorentzenchr commented Jul 3, 2024

Uh oh!

lesteve commented Jul 4, 2024 •

edited

Loading

Uh oh!

jeremiedbb commented Jul 4, 2024

Uh oh!

Uh oh!

Uh oh!

Fix for making the initial binning in HGBT parallel #29386

Fix for making the initial binning in HGBT parallel #29386

Uh oh!

Conversation

OmarManzoor commented Jul 2, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

OmarManzoor commented Jul 2, 2024

Uh oh!

lesteve commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Jul 3, 2024

Uh oh!

lesteve commented Jul 3, 2024

Uh oh!

Uh oh!

lesteve commented Jul 3, 2024

Uh oh!

lorentzenchr commented Jul 3, 2024

Uh oh!

lesteve commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Jul 4, 2024

Uh oh!

Uh oh!

github-actions bot commented Jul 2, 2024 •

edited

Loading

lesteve commented Jul 2, 2024 •

edited

Loading

lesteve commented Jul 2, 2024 •

edited

Loading

lesteve commented Jul 4, 2024 •

edited

Loading