-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Better documentation for n_jobs #14228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Nicolas, can I work on this one? It will be my first issue. Since there are lots of unique occurrences of n_jobs, it might take time. But I am motivated to get it done. |
@arpanchowdhry note that it is labeled as moderate (not "easy") since it requires some knowledge of the implementations. I wouldn't particularly recommend this as a first issue, but if you're confident please feel free to try! |
@NicolasHug I agree that this is not easy as it involves a lot of inner working knowledge. I have been making progress by reading a lot of code. But if you or someone else gets it first that is fine, it has still been good learning for me. |
@NicolasHug Hi! Are you still looking for help? I'm interested in contributing to this issue. Please let me know in case you've room. |
Yes feel free to submit a PR for some estimators |
@vinidixit Please go ahead. |
For example, this doc is added (chronogically) after the discussion in #10415 in this commit. But IMO, seems like the explanation of the parameter is still a bit generic, especially if we assess it based on @NicolasHug's idea of
n_jobs : int or None, optional (default=None)
The number of jobs to use for the computation.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details. Are these kind of documentation counted as good enough or do you think we need more context specific documentation? @NicolasHug |
Thanks for the script @kwinata!
That could definitely be improved: "The number of jobs to use for the computation" is very vague and doesn't say what is parallelized. Users should get a sense of how much faster their code may be when they change |
Welcome @NicolasHug Sure then, I would be happy to work on improving those cases! |
@NicolasHug , I would like to work on @kwinata , please do let me know if the above mentioned cases have already been worked. Thank you |
@PyExtreme thanks! Feel free to start new PRs (Please reference this issue when you do) Tip: you can filter PRs by author and you'll see that @kwinata does not seem to have opened anything yet. |
Yes @PyExtreme feel free to work on them. I have been busy for this past few weeks. Currently I will work on the |
@NicolasHug, I have updated the list of files containing As a note, there is still an open PR (#15613) referred to this issue. |
A link to the glossary is the bare minimum, but that's not what this issue is about. The issue is about:
Any docstring that does not explain where and how parallelization happens should be updated. |
hi ! @annejeevan and I will work on documenting |
take sklearn/_graph.py with @krumeto |
Hi, take |
Hi, I'll work on |
I opend the PR for |
I would be interested in a docs page that has a list of all the algorithms that support parallelism with |
@rikturr I'm not sure such a list would be in-scope for our docs: it would be easy to forget to update and the benefits aren't clear to me. But you can get what you want by programmatically checking the signature of all estimators: https://scikit-learn.org/stable/modules/generated/sklearn.utils.all_estimators.html |
@NicolasHug thanks for the pointer. For anyone else that wants to do it, here's a quick snippet: from sklearn.utils import all_estimators
import inspect
has_n_jobs = []
for est in all_estimators():
s = inspect.signature(est[1])
if 'n_jobs' in s.parameters:
has_n_jobs.append(est)
print(has_n_jobs) |
Sort of a follow-up to #10415.
The great majority of the docstrings for
n_jobs
is something like "number of jobs to run in parallel".It would be much more helpful for users to understand where and how parallelization is actually applied.
For example for random forest: "Each tree is built in parallel" would be a simple yet significant improvement.
The text was updated successfully, but these errors were encountered: