DOC add more details for `n_jobs` in MeanShift docstring #25083

ramvikrams · 2022-11-30T22:36:10Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

changed the docs form n_init to n_jobs

Any other comments?

glemaitre

As mentioned by @jeremiedbb in the original issue, we need to give details regarding which parts of the algorithm benefit from parallelization.

glemaitre · 2022-12-01T10:45:03Z

sklearn/cluster/_mean_shift.py

        The number of jobs to use for the computation. This works by computing
-        each of the n_init runs in parallel.
+        each of the n_jobs runs in parallel.


The number of jobs to use for the computation. The following tasks benefit from the parallelization: the search of nearest neighbors for bandwidth estimation and label assignments, and the hill climbing optimization for all seeds.

glemaitre · 2022-12-01T10:45:10Z

sklearn/cluster/_mean_shift.py

        The number of jobs to use for the computation. This works by computing
-        each of the n_init runs in parallel.
+        each of the n_jobs runs in parallel.


The number of jobs to use for the computation. The following tasks benefit from the parallelization: the search of nearest neighbors for bandwidth estimation and label assignments, and the hill climbing optimization for all seeds.

glemaitre · 2022-12-01T11:25:13Z

sklearn/cluster/_mean_shift.py

@@ -177,7 +177,10 @@ def mean_shift(

    n_jobs : int, default=None
        The number of jobs to use for the computation. This works by computing


remove the sentence

This works by computing each of the n_jobs in parallel.

sklearn/cluster/_mean_shift.py

glemaitre · 2022-12-01T12:56:15Z

sklearn/cluster/_mean_shift.py

+        The number of jobs to use for the computation. The number of jobs to use for
+        the computation. The following tasks benefit from the parallelization: the
+        search of nearest neighbors for bandwidth estimation and label assignments,
+        and the hill climbing optimization for all seeds.


Suggested change

The number of jobs to use for the computation. The number of jobs to use for

the computation. The following tasks benefit from the parallelization: the

search of nearest neighbors for bandwidth estimation and label assignments,

and the hill climbing optimization for all seeds.

The number of jobs to use for the computation. The following tasks benefit

from the parallelization: the search of nearest neighbors for bandwidth

estimation and label assignments, and the hill climbing optimization for all

seeds.

glemaitre · 2022-12-01T12:56:39Z

sklearn/cluster/_mean_shift.py

+        The number of jobs to use for the computation. The number of jobs to use for
+        the computation. The following tasks benefit from the parallelization: the
+        search of nearest neighbors for bandwidth estimation and label assignments,
+        and the hill climbing optimization for all seeds.


Suggested change

The number of jobs to use for the computation. The number of jobs to use for

the computation. The following tasks benefit from the parallelization: the

search of nearest neighbors for bandwidth estimation and label assignments,

and the hill climbing optimization for all seeds.

The number of jobs to use for the computation. The following tasks benefit

from the parallelization: the search of nearest neighbors for bandwidth

estimation and label assignments, and the hill climbing optimization for all

seeds.

Micky774

Left some suggestions for wording, let me know what you think

Micky774 · 2022-12-01T23:37:04Z

sklearn/cluster/_mean_shift.py

+        The number of jobs to use for the computation. The following tasks benefit
+        from the parallelization: the search of nearest neighbors for bandwidth
+        estimation and label assignments, and the hill climbing optimization
+        for all seeds.


Perhaps this would improve readability, and provide a means of further reading.

Suggested change

The number of jobs to use for the computation. The following tasks benefit

from the parallelization: the search of nearest neighbors for bandwidth

estimation and label assignments, and the hill climbing optimization

for all seeds.

The number of jobs to use for the computation. The following tasks benefit

from the parallelization:

- The search of nearest neighbors for bandwidth estimation and label assignments

- Hill-climbing optimization for all seeds

See :term:`Glossary <n_jobs>` for more details.

I also want your suggestion once, we can add it it looks good but this para will stand out because it is in points and para near it are like the real paragraphs. So if we don't have a problem with the look of the doc I think we can change it

Actually, with the refactoring of the Cython code underlying the NearestNeighbors, n_jobs will soon have no effect on that step: the n_jobs parameter in NearestNeighbors is only used for very specific cases (e.g. distance metric passed as Python callable, which is never the case for neighbors computed for MeanShift).

I let @jjerphan confirm the above.

Thanks for the heads-up!

n_jobs is only used when the following conditions are met:

_fit_method=="brute", i.e. when algorithm="brute" or (algorithm="auto" and the heuristic choose "brute" i.e. when (metric="precomputed" or n_feature > 15 or n_neighbors >= n_samples_train))

metric is a callable or is not in the iterable returned by BaseDistancesReductionDispatcher.valid_metric:

In [1]: from sklearn.metrics._pairwise_distances_reduction import BaseDistancesReductionDispatcher In [2]: BaseDistancesReductionDispatcher.valid_metrics() Out[2]: ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'euclidean', 'haversine', 'infinity', 'l1', 'l2', 'manhattan', 'minkowski', 'p', 'seuclidean', 'sqeuclidean', 'wminkowski']

the new back-ends are deactivated via the enable_cython_pairwise_dist parameter of scikit-learn config (they are activated by default)

For a comprehensive apprehension, see: #24997 (comment)

My plan is to support all the cases as much as possible in the new back-ends, but we won't be able to support them all (e.g. Python callables aren't supportable IMO).

Note that some user-facing estimators API probably aren't propagating n_jobs in most cases because for them we might implicitly and ultimately be in configuration where metric="(sq)euclidean" and algorithm="brute".

Hum it's true that it can be used for non-brute algorithms for small n_features. Maybe we can say:

The search of nearest neighbors for bandwidth estimation and label assignments. See the details in the docstring of the NearestNeighbors class.

to keep things simple in the MeanShift class.

Yes I'll add it

Micky774 · 2022-12-01T23:37:19Z

sklearn/cluster/_mean_shift.py

+        The number of jobs to use for the computation. following tasks benefit
+        from the parallelization: the search of nearest neighbors for bandwidth
+        estimation and label assignments, and the hill climbing optimization
+        for all seeds.


Ditto

Suggested change

The number of jobs to use for the computation. following tasks benefit

from the parallelization: the search of nearest neighbors for bandwidth

estimation and label assignments, and the hill climbing optimization

for all seeds.

The number of jobs to use for the computation. The following tasks benefit

from the parallelization:

- The search of nearest neighbors for bandwidth estimation and label assignments

- Hill-climbing optimization for all seeds

See :term:`Glossary <n_jobs>` for more details.

Yes I'll do it

glemaitre · 2022-12-02T10:38:23Z

I am fine with the proposal of @Micky774. We only need to make sure that it renders properly.

ogrisel · 2022-12-02T13:08:12Z

sklearn/cluster/_mean_shift.py

-        The number of jobs to use for the computation. This works by computing
-        each of the n_init runs in parallel.
+        The number of jobs to use for the computation. following tasks benefit
+        from the parallelization: the search of nearest neighbors for bandwidth


Same comment about the impacts of n_jobs for the nearest neighbors computation step.

ramvikrams · 2022-12-02T17:02:35Z

Done with the changes.

Micky774

LGTM when CI is green

@ogrisel just want to check that you are satisfied with the changes

glemaitre · 2022-12-05T10:41:45Z

Something is wrong with the indentation and missing blank line.

/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.MeanShift:46: WARNING: Bullet list ends without a blank line; unexpected unindent.
/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.mean_shift:45: WARNING: Bullet list ends without a blank line; unexpected unindent.

ramvikrams · 2022-12-05T12:06:23Z

Something is wrong with the indentation and missing blank line.

/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.MeanShift:46: WARNING: Bullet list ends without a blank line; unexpected unindent.
/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.mean_shift:45: WARNING: Bullet list ends without a blank line; unexpected unindent.

done

Micky774 · 2022-12-06T17:11:54Z

@ramvikrams regarding the linting errors, are you using black version 22.3.0 to format? If not, that may make things easier. Furthermore, setting up pre-commit as shown in the docs (step 9) can also help ensure that there are no linting errors in the commits you push.

ramvikrams · 2022-12-06T17:18:51Z

yes i am using black when I run black on the file it finishes off successfuly

Micky774 · 2022-12-06T17:38:23Z

I went ahead and fixed the formatting -- some lines were too long. Will merge on green CI, thanks!

ramvikrams · 2022-12-06T17:44:32Z

I went ahead and fixed the formatting -- some lines were too long. Will merge on green CI, thanks!

thanks

…n#25083) * Doc changed n_init to n_jobs in mean_shift.py

* Doc changed n_init to n_jobs in mean_shift.py

…n#25083) * Doc changed n_init to n_jobs in mean_shift.py

Doc changed n_init to n_jobs in mean_shift.py

acde2d1

github-actions bot added the module:cluster label Nov 30, 2022

glemaitre reviewed Dec 1, 2022

View reviewed changes

glemaitre changed the title ~~Doc changed n_init to n_jobs in mean_shift.py~~ DOC add more details for n_jobs in MeanShift docstring Dec 1, 2022

github-actions bot added the Documentation label Dec 1, 2022

Update _mean_shift.py

1f2e4f6

glemaitre approved these changes Dec 1, 2022

View reviewed changes

glemaitre reviewed Dec 1, 2022

View reviewed changes

Update _mean_shift.py

65a24e4

glemaitre reviewed Dec 1, 2022

View reviewed changes

Update _mean_shift.py

756afee

Micky774 reviewed Dec 1, 2022

View reviewed changes

ogrisel reviewed Dec 2, 2022

View reviewed changes

Update _mean_shift.py

3ec8243

Micky774 approved these changes Dec 2, 2022

View reviewed changes

trimmed the trailing whitespaces

54f5739

ramvikrams added 2 commits December 5, 2022 16:55

Update _mean_shift.py

cff2c99

Merge branch 'main' into t1

147119b

Micky774 and others added 2 commits December 6, 2022 12:36

Shortened problematic lines

a66120a

Merge branch 'main' into t1

dbd08dd

Micky774 enabled auto-merge (squash) December 6, 2022 17:39

Micky774 merged commit 9a98487 into scikit-learn:main Dec 6, 2022

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 21, 2022

DOC add more details for n_jobs in MeanShift docstring (scikit-lear…

6d4971e

…n#25083) * Doc changed n_init to n_jobs in mean_shift.py

glemaitre pushed a commit that referenced this pull request Dec 21, 2022

DOC add more details for n_jobs in MeanShift docstring (#25083)

77e441a

* Doc changed n_init to n_jobs in mean_shift.py

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023

DOC add more details for n_jobs in MeanShift docstring (scikit-lear…

6e5c602

…n#25083) * Doc changed n_init to n_jobs in mean_shift.py

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023

DOC add more details for n_jobs in MeanShift docstring (scikit-lear…

bf5014b

…n#25083) * Doc changed n_init to n_jobs in mean_shift.py

		@@ -177,7 +177,10 @@ def mean_shift(

		n_jobs : int, default=None
		The number of jobs to use for the computation. This works by computing

Uh oh!

DOC add more details for n_jobs in MeanShift docstring #25083

DOC add more details for n_jobs in MeanShift docstring #25083

Uh oh!

Conversation

ramvikrams commented Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramvikrams Dec 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Micky774 left a comment

Choose a reason for hiding this comment

Uh oh!

Micky774 Dec 1, 2022 • edited by jjerphan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Dec 2, 2022

Uh oh!

ogrisel Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramvikrams commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Micky774 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramvikrams commented Dec 5, 2022

Uh oh!

Micky774 commented Dec 6, 2022

Uh oh!

DOC add more details for `n_jobs` in MeanShift docstring #25083

DOC add more details for `n_jobs` in MeanShift docstring #25083

ramvikrams commented Nov 30, 2022 •

edited

Loading

ramvikrams Dec 1, 2022 •

edited

Loading

Micky774 Dec 1, 2022 •

edited by jjerphan

Loading

jjerphan Dec 2, 2022 •

edited

Loading

ogrisel Dec 2, 2022 •

edited

Loading

ramvikrams commented Dec 2, 2022 •

edited

Loading

Micky774 left a comment •

edited

Loading

glemaitre commented Dec 5, 2022 •

edited

Loading

ramvikrams commented Dec 6, 2022 •

edited

Loading