Skip to content

DOC add more details for n_jobs in MeanShift docstring #25083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 6, 2022

Conversation

ramvikrams
Copy link
Contributor

@ramvikrams ramvikrams commented Nov 30, 2022

Reference Issues/PRs

Fixes #25075

What does this implement/fix? Explain your changes.

changed the docs form n_init to n_jobs

Any other comments?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by @jeremiedbb in the original issue, we need to give details regarding which parts of the algorithm benefit from parallelization.

Comment on lines 179 to 180
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
each of the n_jobs runs in parallel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        The number of jobs to use for the computation. The following tasks benefit
        from the parallelization: the search of nearest neighbors for bandwidth
        estimation and label assignments, and the hill climbing optimization for all
        seeds.

Comment on lines 315 to 316
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
each of the n_jobs runs in parallel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        The number of jobs to use for the computation. The following tasks benefit
        from the parallelization: the search of nearest neighbors for bandwidth
        estimation and label assignments, and the hill climbing optimization for all
        seeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@glemaitre glemaitre changed the title Doc changed n_init to n_jobs in mean_shift.py DOC add more details for n_jobs in MeanShift docstring Dec 1, 2022
@@ -177,7 +177,10 @@ def mean_shift(

n_jobs : int, default=None
The number of jobs to use for the computation. This works by computing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the sentence

This works by computing each of the n_jobs in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 179 to 182
The number of jobs to use for the computation. The number of jobs to use for
the computation. The following tasks benefit from the parallelization: the
search of nearest neighbors for bandwidth estimation and label assignments,
and the hill climbing optimization for all seeds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The number of jobs to use for the computation. The number of jobs to use for
the computation. The following tasks benefit from the parallelization: the
search of nearest neighbors for bandwidth estimation and label assignments,
and the hill climbing optimization for all seeds.
The number of jobs to use for the computation. The following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization for all
seeds.

Copy link
Contributor Author

@ramvikrams ramvikrams Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 317 to 320
The number of jobs to use for the computation. The number of jobs to use for
the computation. The following tasks benefit from the parallelization: the
search of nearest neighbors for bandwidth estimation and label assignments,
and the hill climbing optimization for all seeds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The number of jobs to use for the computation. The number of jobs to use for
the computation. The following tasks benefit from the parallelization: the
search of nearest neighbors for bandwidth estimation and label assignments,
and the hill climbing optimization for all seeds.
The number of jobs to use for the computation. The following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization for all
seeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions for wording, let me know what you think

Comment on lines 179 to 182
The number of jobs to use for the computation. The following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization
for all seeds.
Copy link
Contributor

@Micky774 Micky774 Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this would improve readability, and provide a means of further reading.

Suggested change
The number of jobs to use for the computation. The following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization
for all seeds.
The number of jobs to use for the computation. The following tasks benefit
from the parallelization:
- The search of nearest neighbors for bandwidth estimation and label assignments
- Hill-climbing optimization for all seeds
See :term:`Glossary <n_jobs>` for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also want your suggestion once, we can add it it looks good but this para will stand out because it is in points and para near it are like the real paragraphs. So if we don't have a problem with the look of the doc I think we can change it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, with the refactoring of the Cython code underlying the NearestNeighbors, n_jobs will soon have no effect on that step: the n_jobs parameter in NearestNeighbors is only used for very specific cases (e.g. distance metric passed as Python callable, which is never the case for neighbors computed for MeanShift).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let @jjerphan confirm the above.

Copy link
Member

@jjerphan jjerphan Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads-up!

n_jobs is only used when the following conditions are met:

In [1]: from sklearn.metrics._pairwise_distances_reduction import BaseDistancesReductionDispatcher
 
In [2]: BaseDistancesReductionDispatcher.valid_metrics()
Out[2]: 
['braycurtis',
 'canberra',
 'chebyshev',
 'cityblock',
 'euclidean',
 'haversine',
 'infinity',
 'l1',
 'l2',
 'manhattan',
 'minkowski',
 'p',
 'seuclidean',
 'sqeuclidean',
 'wminkowski']

For a comprehensive apprehension, see: #24997 (comment)

My plan is to support all the cases as much as possible in the new back-ends, but we won't be able to support them all (e.g. Python callables aren't supportable IMO).

Note that some user-facing estimators API probably aren't propagating n_jobs in most cases because for them we might implicitly and ultimately be in configuration where metric="(sq)euclidean" and algorithm="brute".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum it's true that it can be used for non-brute algorithms for small n_features. Maybe we can say:

  • The search of nearest neighbors for bandwidth estimation and label assignments. See the details in the docstring of the NearestNeighbors class.

to keep things simple in the MeanShift class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll add it

Comment on lines 317 to 320
The number of jobs to use for the computation. following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization
for all seeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Suggested change
The number of jobs to use for the computation. following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
estimation and label assignments, and the hill climbing optimization
for all seeds.
The number of jobs to use for the computation. The following tasks benefit
from the parallelization:
- The search of nearest neighbors for bandwidth estimation and label assignments
- Hill-climbing optimization for all seeds
See :term:`Glossary <n_jobs>` for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll do it

@glemaitre
Copy link
Member

I am fine with the proposal of @Micky774. We only need to make sure that it renders properly.

The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
The number of jobs to use for the computation. following tasks benefit
from the parallelization: the search of nearest neighbors for bandwidth
Copy link
Member

@ogrisel ogrisel Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the impacts of n_jobs for the nearest neighbors computation step.

@ramvikrams
Copy link
Contributor Author

ramvikrams commented Dec 2, 2022

Done with the changes.

Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM when CI is green

@ogrisel just want to check that you are satisfied with the changes

@glemaitre
Copy link
Member

glemaitre commented Dec 5, 2022

Something is wrong with the indentation and missing blank line.

/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.MeanShift:46: WARNING: Bullet list ends without a blank line; unexpected unindent.
/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.mean_shift:45: WARNING: Bullet list ends without a blank line; unexpected unindent.

@ramvikrams
Copy link
Contributor Author

Something is wrong with the indentation and missing blank line.

/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.MeanShift:46: WARNING: Bullet list ends without a blank line; unexpected unindent.
/home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_mean_shift.py:docstring of sklearn.cluster._mean_shift.mean_shift:45: WARNING: Bullet list ends without a blank line; unexpected unindent.

done

@Micky774
Copy link
Contributor

Micky774 commented Dec 6, 2022

@ramvikrams regarding the linting errors, are you using black version 22.3.0 to format? If not, that may make things easier. Furthermore, setting up pre-commit as shown in the docs (step 9) can also help ensure that there are no linting errors in the commits you push.

@ramvikrams
Copy link
Contributor Author

ramvikrams commented Dec 6, 2022

yes i am using black when I run black on the file it finishes off successfuly

@Micky774
Copy link
Contributor

Micky774 commented Dec 6, 2022

I went ahead and fixed the formatting -- some lines were too long. Will merge on green CI, thanks!

@Micky774 Micky774 enabled auto-merge (squash) December 6, 2022 17:39
@ramvikrams
Copy link
Contributor Author

I went ahead and fixed the formatting -- some lines were too long. Will merge on green CI, thanks!

thanks

@Micky774 Micky774 merged commit 9a98487 into scikit-learn:main Dec 6, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 21, 2022
glemaitre pushed a commit that referenced this pull request Dec 21, 2022
* Doc changed n_init to n_jobs in mean_shift.py
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong description for the n_jobs in mean_shift docs
5 participants