Skip to content

ENH add 'cluster_qr' method to spectral segmentation #21148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 97 commits into from
Nov 2, 2021
Merged

ENH add 'cluster_qr' method to spectral segmentation #21148

merged 97 commits into from
Nov 2, 2021

Conversation

lobpcg
Copy link
Contributor

@lobpcg lobpcg commented Sep 25, 2021

Reference Issues/PRs

Re-introduces #12316 and closes #12164

What does this implement/fix? Explain your changes.

#12164 adds the new cluster_qr method to 'kmeans' and 'discretize' in spectral clustering

Any other comments?

The actual changes are just a few lines in only 3 core codes, spectral.py, test_spectral.py, and plot_coin_segmentation.py

re-introducing #12316
trailing space fixed
line too long
added "cluster_qr"
added cluster_qr
E128 continuation line under-indented fixed
added PR #21148 info
formatting fixed
title underline fixed
added cluster_qr to plots
@glemaitre
Copy link
Member

The linter is breaking because you need to run black on the file:

  • sklearn/cluster/_spectral.py
  • sklearn/cluster/tests/test_spectral.py

Another way would be to install pre-commit that will run black for you before committing and ensure that you don't forget to do it.

@glemaitre
Copy link
Member

Here are some concerns that have been already raised in the past:

  • the method does not fulfil the inclusion criterion
  • there is currently no unit tests for the function itself
  • we would need to investigate more on a variety of typical spectral clustering problems to see the benefit in terms of various types of performance: time, memory, statistical score

While the two last points need to be fulfilled, I think that we have now scikit-learn-extra that could allow us to bypass the first point. In this case, we could accept a generic callable to cluster the map using any function. However, even in scikit-learn-extra, points 2 and 3 need to be fulfilled.

@rth do you have any thought regarding the inclusion in scikit-learn-extra once the function is properly tested and we investigated the performance?

@lobpcg
Copy link
Contributor Author

lobpcg commented Sep 27, 2021

The linter is breaking because you need to run black on the file:

  • sklearn/cluster/_spectral.py
  • sklearn/cluster/tests/test_spectral.py

Another way would be to install pre-commit that will run black for you before committing and ensure that you don't forget to do it.

If there is no way around it, this looks like a bug in CI framework - breaking the ability of purely web-based PR submission and maintenance, which was available prior to requiring this black check. @glemaitre Let's move this topic to #21169 that I have just created.

@lobpcg
Copy link
Contributor Author

lobpcg commented Sep 27, 2021

Here are some concerns that have been already raised in the past:

  • the method does not fulfil the inclusion criterion

@glemaitre @jnothman @ogrisel The original PR was 3 years ago. Since then, the papers that use the clusterQR (I renamed into cluster_qr) have been cited maybe hundred times. The original original report and the paper alone are cited 25 times https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Kv-Y1qYAAAAJ&citation_for_view=Kv-Y1qYAAAAJ:_FxGoFyzp5QC ; see also
https://www.google.com/search?q=%22Robust+and+efficient+multi-way+spectral+clustering%22

Our own implementation https://doi.org/10.1109/HPEC.2017.8091045 has won MIT/AWS Graph Challenge 2017 Student Innovation Awards https://graphchallenge.mit.edu/champions. 3 years later, it remains a canonical approach, answering @jnothman, most recently implemented and tested in https://ieeexplore.ieee.org/document/9286181 that has won MIT/AWS Graph Challenge 2020 Innovation Award, https://graphchallenge.mit.edu/champions.

What else needs to happed to fulfil the inclusion criterion?

black formatting
black formatting
trying to change the discretize test by itself to also test kmeans by itself and cluster_qr by itself
adding test cluster_qr by itself to the same test with discretize
@rth
Copy link
Member

rth commented Sep 27, 2021

do you have any thought regarding the inclusion in scikit-learn-extra

Happy to include it there, but it looks like it might be hard to integrate while leaving all the rest of spectral_clustering code in scikit-learn. Or at least that would require supporting spectral_clustering(assign_labels=..) as a callable for instance.

Our own implementation [..] it remains a canonical approach

What else needs to happen to fulfill the inclusion criterion?

For context, I think historically there is some reluctance to include algorithms proposed by their authors in scikit-learn because there can be a conflict of interest between including some feature in the best interest of users and increased paper citations as a result of it.

I'm not saying it's an issue here, and the code certainly look straightforward and short. However, IMO the most compelling argument is still, #21148 (comment)

we would need to investigate more on a variety of typical spectral clustering problems to see the benefit in terms of various types of performance: time, memory, statistical score

as in how this would impact the average user.

@lobpcg
Copy link
Contributor Author

lobpcg commented Sep 27, 2021

For context, I think historically there is some reluctance to include algorithms proposed by their authors in scikit-learn because there can be a conflict of interest between including some feature in the best interest of users and increased paper citations as a result of it.

@rth to clarify, I am not an author of the algorithm, but a happy user. In contrast to kmeans, it actually works, which one can observe, e.g., in the included example #12164 (comment)

The authors just provided the MATLAB implementation https://github.com/asdamle/QR-spectral-clustering, The algorithm is so simple that turning it from MATLAB into Python is trivial - getting it into sklearn has turned out to be not :-)

The authors have provided convincing examples in their paper, using their MATLAB code, but I agree that

we would need to investigate more on a variety of typical spectral clustering problems to see the benefit in terms of various types of performance: time, memory, statistical score

so someone will need to write a good benchmark for the present codes. My feeling is that cluster_qr would well outperform the competitors in terms of quality, yet being faster, and probably with a smaller memory footprint.

cluster_qr apparently requires n_class>2, so change the test to start with n_class=3
reverted to working version
@lobpcg
Copy link
Contributor Author

lobpcg commented Sep 28, 2021

For benchmarking, we can follow the example of https://github.com/glemaitre/bench_lobpcg/blob/master/bench_lobpbcg.py and for the data use examples from the stochastic block model http://graphchallenge.mit.edu/data-sets
Low Block Overlap and Low Block Size Variation    | Low level of overlap and low level of size variation between blocks (easiest)
Static Graphs | 1K | 5K | 20K | 50K | 200K | 1M

Please let me know if there are alternative suggestions. If not, someone would probably need to download the files. We could alternatively generate the data on the fly. Please advise and let me know the preference.

lobpcg and others added 5 commits October 25, 2021 21:29
remove comments

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
an author added as suggested
doi + author
added a reference to cluster_qr
lobpcg with amg references added as requested
@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 26, 2021

Here is a first superficial review.

Even though I still need to get in the details of the paper and of the implementation of cluster_qr, to me this looks like an relevant labeling method. 👍

@jjerphan thanks for your first review! I've made all the requested changes. Please let me know if I need to do anything else.

@jjerphan
Copy link
Member

Hi @lobpcg,

I am busy with more urgent tasks but will come back to you when possible.
Feel free to ping me here via mentions or by requesting another request if I am not replying.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last suggestions and discussions follow-ups before this LGTM.

We are all doing our best to have the project move forward. We prefer taking time to get familiar with new material to prevent errors both in implementations and in users' understanding of methods and their documentation.

Thanks for your patience, @lobpcg. 🙂

lobpcg and others added 4 commits October 29, 2021 13:01
misc editing

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
previous missed suggestions finally committed

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
    assert labels_float64.dtype == np.int64 naturally failed on 32-bit OS so was removed
@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 29, 2021

@jjerphan all proposed changes finally made (sorry for missing your earlier suggestions). All green, should be ready to merge?

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you, @lobpcg!

@lobpcg
Copy link
Contributor Author

lobpcg commented Oct 30, 2021

@ogrisel @jjerphan thanks for reviewing! What should I do next for this PR to be merged?

@glemaitre glemaitre changed the title [MRG] cluster_qr method added to spectral segmentation ENH add 'cluster_qr' method to spectral segmentation Oct 30, 2021
@glemaitre
Copy link
Member

@lobpcg if all the reviews have been addressed then this PR should be merged. I don't want to start a new review so only pinging @ogrisel to be sure that his +1 is still standing (which should be) and he will merge this PR.

@jjerphan
Copy link
Member

jjerphan commented Nov 1, 2021

I would also wait for his approval regarding your latest changes.

@lobpcg
Copy link
Contributor Author

lobpcg commented Nov 1, 2021

@ogrisel could you please have a look, hopefully final?
I want to start another PR and need to clean/delete my old sklearn repo fork before that as far as I can see, so this PR is a blocker for me.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a typo that I will fix before merging.

@ogrisel ogrisel merged commit f449af8 into scikit-learn:main Nov 2, 2021
@ogrisel
Copy link
Member

ogrisel commented Nov 2, 2021

Merged, thank you very much for the contrib!

@lobpcg
Copy link
Contributor Author

lobpcg commented Nov 2, 2021

I am glad to see this long journey finally completed. Thanks very much everyone involved!

@lobpcg lobpcg deleted the patch-1 branch November 2, 2021 11:38
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Victor Minden <victorminden@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

new feature: add clusterQR method to 'kmeans' and 'discretize' in spectral clustering
6 participants