Better documentation for n_jobs #14228

NicolasHug · 2019-07-01T14:18:27Z

Sort of a follow-up to #10415.

The great majority of the docstrings for n_jobs is something like "number of jobs to run in parallel".

It would be much more helpful for users to understand where and how parallelization is actually applied.

For example for random forest: "Each tree is built in parallel" would be a simple yet significant improvement.

The text was updated successfully, but these errors were encountered:

arpanchowdhry · 2019-07-03T20:58:33Z

Hi Nicolas, can I work on this one? It will be my first issue. Since there are lots of unique occurrences of n_jobs, it might take time. But I am motivated to get it done.

NicolasHug · 2019-07-03T21:03:13Z

@arpanchowdhry note that it is labeled as moderate (not "easy") since it requires some knowledge of the implementations. I wouldn't particularly recommend this as a first issue, but if you're confident please feel free to try!

arpanchowdhry · 2019-07-15T05:32:16Z

@NicolasHug I agree that this is not easy as it involves a lot of inner working knowledge. I have been making progress by reading a lot of code. But if you or someone else gets it first that is fine, it has still been good learning for me.

vinidixit · 2019-07-25T16:20:43Z

@NicolasHug Hi! Are you still looking for help? I'm interested in contributing to this issue. Please let me know in case you've room.

NicolasHug · 2019-07-25T17:09:06Z

Yes feel free to submit a PR for some estimators

arpanchowdhry · 2019-07-25T17:21:48Z

@vinidixit Please go ahead.

kwinata · 2019-09-25T04:00:55Z

Based on the script:

import re
from collections import defaultdict
from pathlib import Path

matches = defaultdict(list)
for filename in Path('sklearn').glob('**/*.py'):
    with open(filename) as f:
        for i, line in enumerate(f.readlines()):
            match = re.search(r"n_jobs ?:", line)
            if match:
                matches[str(filename)].append(i+1)

for file, lines in matches.items():
    print('- [ ] [{FN}](https://github.com/scikit-learn/scikit-learn/blob/master/{FN})'.format(FN=file), end=" - ")
    line_number_format = '[{LN}](https://github.com/scikit-learn/scikit-learn/blob/master/{FN}#L{LN})'
    line_number_links = ', '.join([line_number_format.format(FN=file, LN=line) for line in lines])
    print(line_number_links)

I got this lists:

I can help to check them to see the progress

[Edit 13 Nov: Script rerun for some file renaming]

kwinata · 2019-09-25T04:08:49Z

For example, this doc is added (chronogically) after the discussion in #10415 in this commit.

But IMO, seems like the explanation of the parameter is still a bit generic, especially if we assess it based on @NicolasHug's idea of

It would be much more helpful for users to understand where and how parallelization is actually applied.
For example for random forest: "Each tree is built in parallel" would be a simple yet significant improvement.

    n_jobs : int or None, optional (default=None)
        The number of jobs to use for the computation.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

Are these kind of documentation counted as good enough or do you think we need more context specific documentation? @NicolasHug

NicolasHug · 2019-09-25T12:01:16Z

Thanks for the script @kwinata!

Are these kind of documentation counted as good enough or do you think we need more context specific documentation? @NicolasHug

That could definitely be improved: "The number of jobs to use for the computation" is very vague and doesn't say what is parallelized. Users should get a sense of how much faster their code may be when they change n_jobs.

kwinata · 2019-09-25T14:13:23Z

Welcome @NicolasHug

Sure then, I would be happy to work on improving those cases!

PyExtreme · 2019-10-14T05:26:53Z

@NicolasHug , I would like to work on sklearn/linear_model/stochastic_gradient.py, sklearn/linear_model/coordinate_descent.py and sklearn/ensemble/forest.py just for the start.
Later, I would like to pick on batches.

@kwinata , please do let me know if the above mentioned cases have already been worked.

Thank you

NicolasHug · 2019-10-14T13:47:26Z

@PyExtreme thanks! Feel free to start new PRs (Please reference this issue when you do)

Tip: you can filter PRs by author and you'll see that @kwinata does not seem to have opened anything yet.

kwinata · 2019-10-15T06:05:55Z

Yes @PyExtreme feel free to work on them. I have been busy for this past few weeks. Currently I will work on the cluster and neighbors algorithms.

cmarmo · 2020-01-09T10:38:39Z

@NicolasHug, I have updated the list of files containing n_jobs in the last release (see this gist). Checked files there are those containing n_jobs with a ref to the Glossary: they are the majority. I'm wondering if this could still be a good issue for a Sprint (it was used at the Scipy sprint 2019 if I understand correctly). It is not an obvious task to identify if an explanation needs to be updated or not (that's why you labelled with 'Moderate'), and in that case (unlike the random_state issue #10548) there is not even a simple addition to provide. May I suggest to provide a list of the explanations that you really feel unsatisfactory? No deadline... :) I'm not adding this one to Sprints , if you are ok with that.

As a note, there is still an open PR (#15613) referred to this issue.

NicolasHug · 2020-01-11T11:15:16Z

Checked files there are those containing n_jobs with a ref to the Glossary: they are the majority

A link to the glossary is the bare minimum, but that's not what this issue is about. The issue is about:

The great majority of the docstrings for n_jobs is something like "number of jobs to run in parallel".

Any docstring that does not explain where and how parallelization happens should be updated.

emdupre · 2020-06-06T16:55:00Z

hi ! @annejeevan and I will work on documenting n_jobs for forest models and multiclass as part of the data umbrella sprint.

CeeThinwa · 2020-06-06T21:34:13Z

take sklearn/_graph.py with @krumeto

ghost · 2020-06-25T08:41:19Z

Hi, take sklearn/gaussian_process/_gpc.py

hs-nazuna · 2020-06-25T08:46:26Z

Hi, I'll work on sklearn/model_selection/_validation.py

hs-nazuna · 2020-06-25T13:51:14Z

I opend the PR for _permutation_importance.py too.

tnwei · 2020-10-17T07:38:09Z

Submitted #18633 and #18634 for sklearn/calibration.py and sklearn/multioutput.py

rikturr · 2021-02-09T17:15:23Z

I would be interested in a docs page that has a list of all the algorithms that support parallelism with n_jobs. Specifically because its helpful to get a high-level glance at which algorithms might be able to utilize Dask for parallelism across a cluster. Would the parallelism page be a good place for this, or perhaps a new page? I would be happy to contribute this, just want to get an idea of what the maintainers think about it

NicolasHug · 2021-02-09T19:14:34Z

@rikturr I'm not sure such a list would be in-scope for our docs: it would be easy to forget to update and the benefits aren't clear to me.

But you can get what you want by programmatically checking the signature of all estimators: https://scikit-learn.org/stable/modules/generated/sklearn.utils.all_estimators.html

rikturr · 2021-02-09T19:45:27Z

@NicolasHug thanks for the pointer. For anyone else that wants to do it, here's a quick snippet:

from sklearn.utils import all_estimators
import inspect

has_n_jobs = []
for est in all_estimators():
    s = inspect.signature(est[1])
    if 'n_jobs' in s.parameters:
        has_n_jobs.append(est)
print(has_n_jobs)

NicolasHug added Documentation Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Jul 1, 2019

jeremiedbb mentioned this issue Jul 5, 2019

RFC How should we control/expose number of threads for our OpenMP based parallel cython code ? #14265

Closed

NicolasHug mentioned this issue Aug 11, 2019

[MRG] DOC n_jobs descriptions for forest.py #14628

Merged

NicolasHug mentioned this issue Oct 12, 2019

Better documentation for random_state #15222

Closed

project-delphi mentioned this issue Oct 14, 2019

Better Documentation for Density Estimation #15250

Open

kwinata mentioned this issue Nov 13, 2019

Add explanation for n_jobs in cluster._spectral #15613

Merged

cmarmo mentioned this issue Jan 6, 2020

RFC: referring to Glossary to make parameter descriptions more focussed #10415

Closed

jnothman mentioned this issue Jan 7, 2020

Make random_state descriptions more informative and refer to Glossary #10548

Closed

74 tasks

emdupre mentioned this issue Jun 6, 2020

[DOC] Better document n_jobs param in forest models #17480

Closed

annejeevan mentioned this issue Jun 6, 2020

[DOC] Fixing n_jobs doc param in multiclass.py #17489

Merged

CeeThinwa mentioned this issue Jun 6, 2020

DOC: updated documentation for n_jobs in _graph.py #DataUmbrella #17513

Closed

ghost mentioned this issue Jun 25, 2020

DOC Add explanation of n_jobs in gaussian_process/_gpc.py #17716

Merged

This was referenced Jun 25, 2020

DOC add explanation of n_jobs in sklearn/model_selection/_validation.py #17717

Merged

DOC add explanation of n_jobs in permutation_importance #17723

Merged

NicolasHug mentioned this issue Oct 14, 2020

Parallel computing with nested cross-validation #10232

Closed

This was referenced Oct 17, 2020

DOC improve n_jobs docs in CalibratedClassifierCV #18633

Closed

DOC Improve n_jobs docs in sklearn/multioutput.py #18634

Merged

This was referenced Mar 22, 2021

0.24.x #19748

Closed

DOC Clarified parallelization of plot_partial_dependence with n_jobs #19749

Closed

DOC Clarified n_jobs parallelization in plot_partial_dependence #19750

Merged

simonandras mentioned this issue Oct 12, 2021

DOC Linear regression doc update #21258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better documentation for n_jobs #14228

Better documentation for n_jobs #14228

NicolasHug commented Jul 1, 2019

arpanchowdhry commented Jul 3, 2019

NicolasHug commented Jul 3, 2019

arpanchowdhry commented Jul 15, 2019

vinidixit commented Jul 25, 2019

NicolasHug commented Jul 25, 2019

arpanchowdhry commented Jul 25, 2019

kwinata commented Sep 25, 2019 •

edited by rth

Loading

kwinata commented Sep 25, 2019 •

edited

Loading

NicolasHug commented Sep 25, 2019

kwinata commented Sep 25, 2019

PyExtreme commented Oct 14, 2019 •

edited

Loading

NicolasHug commented Oct 14, 2019

kwinata commented Oct 15, 2019 •

edited

Loading

cmarmo commented Jan 9, 2020 •

edited

Loading

NicolasHug commented Jan 11, 2020

emdupre commented Jun 6, 2020 •

edited

Loading

CeeThinwa commented Jun 6, 2020 •

edited

Loading

ghost commented Jun 25, 2020

hs-nazuna commented Jun 25, 2020 •

edited

Loading

hs-nazuna commented Jun 25, 2020

tnwei commented Oct 17, 2020 •

edited

Loading

rikturr commented Feb 9, 2021

NicolasHug commented Feb 9, 2021

rikturr commented Feb 9, 2021

Better documentation for n_jobs #14228

Better documentation for n_jobs #14228

Comments

NicolasHug commented Jul 1, 2019

arpanchowdhry commented Jul 3, 2019

NicolasHug commented Jul 3, 2019

arpanchowdhry commented Jul 15, 2019

vinidixit commented Jul 25, 2019

NicolasHug commented Jul 25, 2019

arpanchowdhry commented Jul 25, 2019

kwinata commented Sep 25, 2019 • edited by rth Loading

kwinata commented Sep 25, 2019 • edited Loading

NicolasHug commented Sep 25, 2019

kwinata commented Sep 25, 2019

PyExtreme commented Oct 14, 2019 • edited Loading

NicolasHug commented Oct 14, 2019

kwinata commented Oct 15, 2019 • edited Loading

cmarmo commented Jan 9, 2020 • edited Loading

NicolasHug commented Jan 11, 2020

emdupre commented Jun 6, 2020 • edited Loading

CeeThinwa commented Jun 6, 2020 • edited Loading

ghost commented Jun 25, 2020

hs-nazuna commented Jun 25, 2020 • edited Loading

hs-nazuna commented Jun 25, 2020

tnwei commented Oct 17, 2020 • edited Loading

rikturr commented Feb 9, 2021

NicolasHug commented Feb 9, 2021

rikturr commented Feb 9, 2021

kwinata commented Sep 25, 2019 •

edited by rth

Loading

kwinata commented Sep 25, 2019 •

edited

Loading

PyExtreme commented Oct 14, 2019 •

edited

Loading

kwinata commented Oct 15, 2019 •

edited

Loading

cmarmo commented Jan 9, 2020 •

edited

Loading

emdupre commented Jun 6, 2020 •

edited

Loading

CeeThinwa commented Jun 6, 2020 •

edited

Loading

hs-nazuna commented Jun 25, 2020 •

edited

Loading

tnwei commented Oct 17, 2020 •

edited

Loading