-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC add more details for n_jobs
in MeanShift docstring
#25083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by @jeremiedbb in the original issue, we need to give details regarding which parts of the algorithm benefit from parallelization.
sklearn/cluster/_mean_shift.py
Outdated
The number of jobs to use for the computation. This works by computing | ||
each of the n_init runs in parallel. | ||
each of the n_jobs runs in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of jobs to use for the computation. The following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization for all
seeds.
sklearn/cluster/_mean_shift.py
Outdated
The number of jobs to use for the computation. This works by computing | ||
each of the n_init runs in parallel. | ||
each of the n_jobs runs in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of jobs to use for the computation. The following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization for all
seeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
n_jobs
in MeanShift docstring
sklearn/cluster/_mean_shift.py
Outdated
@@ -177,7 +177,10 @@ def mean_shift( | |||
|
|||
n_jobs : int, default=None | |||
The number of jobs to use for the computation. This works by computing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the sentence
This works by computing each of the n_jobs in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/cluster/_mean_shift.py
Outdated
The number of jobs to use for the computation. The number of jobs to use for | ||
the computation. The following tasks benefit from the parallelization: the | ||
search of nearest neighbors for bandwidth estimation and label assignments, | ||
and the hill climbing optimization for all seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of jobs to use for the computation. The number of jobs to use for | |
the computation. The following tasks benefit from the parallelization: the | |
search of nearest neighbors for bandwidth estimation and label assignments, | |
and the hill climbing optimization for all seeds. | |
The number of jobs to use for the computation. The following tasks benefit | |
from the parallelization: the search of nearest neighbors for bandwidth | |
estimation and label assignments, and the hill climbing optimization for all | |
seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/cluster/_mean_shift.py
Outdated
The number of jobs to use for the computation. The number of jobs to use for | ||
the computation. The following tasks benefit from the parallelization: the | ||
search of nearest neighbors for bandwidth estimation and label assignments, | ||
and the hill climbing optimization for all seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of jobs to use for the computation. The number of jobs to use for | |
the computation. The following tasks benefit from the parallelization: the | |
search of nearest neighbors for bandwidth estimation and label assignments, | |
and the hill climbing optimization for all seeds. | |
The number of jobs to use for the computation. The following tasks benefit | |
from the parallelization: the search of nearest neighbors for bandwidth | |
estimation and label assignments, and the hill climbing optimization for all | |
seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some suggestions for wording, let me know what you think
sklearn/cluster/_mean_shift.py
Outdated
The number of jobs to use for the computation. The following tasks benefit | ||
from the parallelization: the search of nearest neighbors for bandwidth | ||
estimation and label assignments, and the hill climbing optimization | ||
for all seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this would improve readability, and provide a means of further reading.
The number of jobs to use for the computation. The following tasks benefit | |
from the parallelization: the search of nearest neighbors for bandwidth | |
estimation and label assignments, and the hill climbing optimization | |
for all seeds. | |
The number of jobs to use for the computation. The following tasks benefit | |
from the parallelization: | |
- The search of nearest neighbors for bandwidth estimation and label assignments | |
- Hill-climbing optimization for all seeds | |
See :term:`Glossary <n_jobs>` for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want your suggestion once, we can add it it looks good but this para will stand out because it is in points and para near it are like the real paragraphs. So if we don't have a problem with the look of the doc I think we can change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, with the refactoring of the Cython code underlying the NearestNeighbors
, n_jobs
will soon have no effect on that step: the n_jobs
parameter in NearestNeighbors
is only used for very specific cases (e.g. distance metric passed as Python callable, which is never the case for neighbors computed for MeanShift
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I let @jjerphan confirm the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads-up!
n_jobs
is only used when the following conditions are met:
_fit_method=="brute"
, i.e. whenalgorithm="brute"
or (algorithm="auto"
and the heuristic choose"brute"
i.e. when (metric="precomputed"
orn_feature > 15
orn_neighbors >= n_samples_train
))metric
is a callable or is not in the iterable returned byBaseDistancesReductionDispatcher.valid_metric
:
In [1]: from sklearn.metrics._pairwise_distances_reduction import BaseDistancesReductionDispatcher
In [2]: BaseDistancesReductionDispatcher.valid_metrics()
Out[2]:
['braycurtis',
'canberra',
'chebyshev',
'cityblock',
'euclidean',
'haversine',
'infinity',
'l1',
'l2',
'manhattan',
'minkowski',
'p',
'seuclidean',
'sqeuclidean',
'wminkowski']
- the new back-ends are deactivated via the
enable_cython_pairwise_dist
parameter of scikit-learn config (they are activated by default)
For a comprehensive apprehension, see: #24997 (comment)
My plan is to support all the cases as much as possible in the new back-ends, but we won't be able to support them all (e.g. Python callables aren't supportable IMO).
Note that some user-facing estimators API probably aren't propagating n_jobs
in most cases because for them we might implicitly and ultimately be in configuration where metric="(sq)euclidean"
and algorithm="brute"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum it's true that it can be used for non-brute algorithms for small n_features
. Maybe we can say:
- The search of nearest neighbors for bandwidth estimation and label assignments. See the details in the docstring of the
NearestNeighbors
class.
to keep things simple in the MeanShift
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll add it
sklearn/cluster/_mean_shift.py
Outdated
The number of jobs to use for the computation. following tasks benefit | ||
from the parallelization: the search of nearest neighbors for bandwidth | ||
estimation and label assignments, and the hill climbing optimization | ||
for all seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
The number of jobs to use for the computation. following tasks benefit | |
from the parallelization: the search of nearest neighbors for bandwidth | |
estimation and label assignments, and the hill climbing optimization | |
for all seeds. | |
The number of jobs to use for the computation. The following tasks benefit | |
from the parallelization: | |
- The search of nearest neighbors for bandwidth estimation and label assignments | |
- Hill-climbing optimization for all seeds | |
See :term:`Glossary <n_jobs>` for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll do it
I am fine with the proposal of @Micky774. We only need to make sure that it renders properly. |
sklearn/cluster/_mean_shift.py
Outdated
The number of jobs to use for the computation. This works by computing | ||
each of the n_init runs in parallel. | ||
The number of jobs to use for the computation. following tasks benefit | ||
from the parallelization: the search of nearest neighbors for bandwidth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the impacts of n_jobs
for the nearest neighbors computation step.
Done with the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM when CI is green
@ogrisel just want to check that you are satisfied with the changes
Something is wrong with the indentation and missing blank line.
|
done |
@ramvikrams regarding the linting errors, are you using |
yes i am using |
I went ahead and fixed the formatting -- some lines were too long. Will merge on green CI, thanks! |
thanks |
…n#25083) * Doc changed n_init to n_jobs in mean_shift.py
* Doc changed n_init to n_jobs in mean_shift.py
…n#25083) * Doc changed n_init to n_jobs in mean_shift.py
…n#25083) * Doc changed n_init to n_jobs in mean_shift.py
Reference Issues/PRs
Fixes #25075
What does this implement/fix? Explain your changes.
changed the docs form
n_init
ton_jobs
Any other comments?