Skip to content

Better documentation for n_jobs #14228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NicolasHug opened this issue Jul 1, 2019 · 24 comments
Open

Better documentation for n_jobs #14228

NicolasHug opened this issue Jul 1, 2019 · 24 comments
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices

Comments

@NicolasHug
Copy link
Member

Sort of a follow-up to #10415.

The great majority of the docstrings for n_jobs is something like "number of jobs to run in parallel".

It would be much more helpful for users to understand where and how parallelization is actually applied.

For example for random forest: "Each tree is built in parallel" would be a simple yet significant improvement.

@NicolasHug NicolasHug added Documentation Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Jul 1, 2019
@arpanchowdhry
Copy link
Contributor

Hi Nicolas, can I work on this one? It will be my first issue. Since there are lots of unique occurrences of n_jobs, it might take time. But I am motivated to get it done.

@NicolasHug
Copy link
Member Author

@arpanchowdhry note that it is labeled as moderate (not "easy") since it requires some knowledge of the implementations. I wouldn't particularly recommend this as a first issue, but if you're confident please feel free to try!

@arpanchowdhry
Copy link
Contributor

@NicolasHug I agree that this is not easy as it involves a lot of inner working knowledge. I have been making progress by reading a lot of code. But if you or someone else gets it first that is fine, it has still been good learning for me.

@vinidixit
Copy link

@NicolasHug Hi! Are you still looking for help? I'm interested in contributing to this issue. Please let me know in case you've room.

@NicolasHug
Copy link
Member Author

Yes feel free to submit a PR for some estimators

@arpanchowdhry
Copy link
Contributor

@vinidixit Please go ahead.

@kwinata
Copy link
Contributor

kwinata commented Sep 25, 2019

Based on the script:

import re
from collections import defaultdict
from pathlib import Path

matches = defaultdict(list)
for filename in Path('sklearn').glob('**/*.py'):
    with open(filename) as f:
        for i, line in enumerate(f.readlines()):
            match = re.search(r"n_jobs ?:", line)
            if match:
                matches[str(filename)].append(i+1)

for file, lines in matches.items():
    print('- [ ] [{FN}](https://github.com/scikit-learn/scikit-learn/blob/master/{FN})'.format(FN=file), end=" - ")
    line_number_format = '[{LN}](https://github.com/scikit-learn/scikit-learn/blob/master/{FN}#L{LN})'
    line_number_links = ', '.join([line_number_format.format(FN=file, LN=line) for line in lines])
    print(line_number_links)

I got this lists:

I can help to check them to see the progress

[Edit 13 Nov: Script rerun for some file renaming]

@kwinata
Copy link
Contributor

kwinata commented Sep 25, 2019

For example, this doc is added (chronogically) after the discussion in #10415 in this commit.

But IMO, seems like the explanation of the parameter is still a bit generic, especially if we assess it based on @NicolasHug's idea of

It would be much more helpful for users to understand where and how parallelization is actually applied.
For example for random forest: "Each tree is built in parallel" would be a simple yet significant improvement.

    n_jobs : int or None, optional (default=None)
        The number of jobs to use for the computation.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

Are these kind of documentation counted as good enough or do you think we need more context specific documentation? @NicolasHug

@NicolasHug
Copy link
Member Author

Thanks for the script @kwinata!

Are these kind of documentation counted as good enough or do you think we need more context specific documentation? @NicolasHug

That could definitely be improved: "The number of jobs to use for the computation" is very vague and doesn't say what is parallelized. Users should get a sense of how much faster their code may be when they change n_jobs.

@kwinata
Copy link
Contributor

kwinata commented Sep 25, 2019

Welcome @NicolasHug

Sure then, I would be happy to work on improving those cases!

@PyExtreme
Copy link
Contributor

PyExtreme commented Oct 14, 2019

@NicolasHug , I would like to work on sklearn/linear_model/stochastic_gradient.py, sklearn/linear_model/coordinate_descent.py and sklearn/ensemble/forest.py just for the start.
Later, I would like to pick on batches.

@kwinata , please do let me know if the above mentioned cases have already been worked.

Thank you

@NicolasHug
Copy link
Member Author

@PyExtreme thanks! Feel free to start new PRs (Please reference this issue when you do)

Tip: you can filter PRs by author and you'll see that @kwinata does not seem to have opened anything yet.

@kwinata
Copy link
Contributor

kwinata commented Oct 15, 2019

Yes @PyExtreme feel free to work on them. I have been busy for this past few weeks. Currently I will work on the cluster and neighbors algorithms.

@cmarmo
Copy link
Contributor

cmarmo commented Jan 9, 2020

@NicolasHug, I have updated the list of files containing n_jobs in the last release (see this gist). Checked files there are those containing n_jobs with a ref to the Glossary: they are the majority. I'm wondering if this could still be a good issue for a Sprint (it was used at the Scipy sprint 2019 if I understand correctly). It is not an obvious task to identify if an explanation needs to be updated or not (that's why you labelled with 'Moderate'), and in that case (unlike the random_state issue #10548) there is not even a simple addition to provide. May I suggest to provide a list of the explanations that you really feel unsatisfactory? No deadline... :) I'm not adding this one to Sprints , if you are ok with that.

As a note, there is still an open PR (#15613) referred to this issue.

@NicolasHug
Copy link
Member Author

Checked files there are those containing n_jobs with a ref to the Glossary: they are the majority

A link to the glossary is the bare minimum, but that's not what this issue is about. The issue is about:

The great majority of the docstrings for n_jobs is something like "number of jobs to run in parallel".

Any docstring that does not explain where and how parallelization happens should be updated.

@emdupre
Copy link
Contributor

emdupre commented Jun 6, 2020

hi ! @annejeevan and I will work on documenting n_jobs for forest models and multiclass as part of the data umbrella sprint.

@CeeThinwa
Copy link
Contributor

CeeThinwa commented Jun 6, 2020

take sklearn/_graph.py with @krumeto

@ghost
Copy link

ghost commented Jun 25, 2020

Hi, take sklearn/gaussian_process/_gpc.py

@hs-nazuna
Copy link
Contributor

hs-nazuna commented Jun 25, 2020

Hi, I'll work on sklearn/model_selection/_validation.py

@hs-nazuna
Copy link
Contributor

I opend the PR for _permutation_importance.py too.

@tnwei
Copy link
Contributor

tnwei commented Oct 17, 2020

Submitted #18633 and #18634 for sklearn/calibration.py and sklearn/multioutput.py

@rikturr
Copy link

rikturr commented Feb 9, 2021

I would be interested in a docs page that has a list of all the algorithms that support parallelism with n_jobs. Specifically because its helpful to get a high-level glance at which algorithms might be able to utilize Dask for parallelism across a cluster. Would the parallelism page be a good place for this, or perhaps a new page? I would be happy to contribute this, just want to get an idea of what the maintainers think about it

@NicolasHug
Copy link
Member Author

@rikturr I'm not sure such a list would be in-scope for our docs: it would be easy to forget to update and the benefits aren't clear to me.

But you can get what you want by programmatically checking the signature of all estimators: https://scikit-learn.org/stable/modules/generated/sklearn.utils.all_estimators.html

@rikturr
Copy link

rikturr commented Feb 9, 2021

@NicolasHug thanks for the pointer. For anyone else that wants to do it, here's a quick snippet:

from sklearn.utils import all_estimators
import inspect

has_n_jobs = []
for est in all_estimators():
    s = inspect.signature(est[1])
    if 'n_jobs' in s.parameters:
        has_n_jobs.append(est)
print(has_n_jobs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests