Skip to content

[MRG] FIX sklearn.cluster.spectral_clustering fails with np.matrix input (#10993) #11070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

sinanh
Copy link

@sinanh sinanh commented May 5, 2018

Reference Issues/PRs

Fixes #10993

What does this implement/fix? Explain your changes.

It fixes a small bug in spectral_embedding while finding the neighbors when the input array is a np.matrix.

Any other comments?

@jnothman
Copy link
Member

jnothman commented May 6, 2018

Not changing estimator_checks?

@sinanh
Copy link
Author

sinanh commented May 6, 2018

At first I changed estimator_checks, then I noticed that was not necessary. Because after some testing, I noticed that all the clustering algorithms (e.g. KMeans, AgglomerativeClustering, MeanShift, DBSCAN etc.) work with np.matrix smoothly. Thus, I think the mentioned problem in this issue was limited to Spectral Clustering.

@jnothman
Copy link
Member

jnothman commented May 6, 2018 via email

@glemaitre
Copy link
Member

Spectral clustering will call the check_array while spectral_clustering will not. check_array will convert the matrix to array. Is it meaning that we should have a second sanity check in the function themselves instead of only the class.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this LGTM.

Though we should really have a generic test for consistency between class and function versions of clusterers that makes this check, instead of ad-hoc tests.

@@ -60,6 +60,8 @@ def _graph_connected_component(graph, node_id):
for i in indices:
if sparse.issparse(graph):
neighbors = graph[i].toarray().ravel()
elif isinstance(graph, np.matrix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just convert and store as graph at the top of the function?

@jnothman
Copy link
Member

jnothman commented May 7, 2018 via email

@amueller
Copy link
Member

I thought we didn't support np.matrix, but I guess if check_array accepts it then it's fine?

@@ -44,6 +44,8 @@ def _graph_connected_component(graph, node_id):
node
"""
n_node = graph.shape[0]
graph = sparse.csc_matrix(graph)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right -- it is converted to CSR on the following line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think you are right.

To be compliant with jnothman's suggestion on May 8th about converting and storing it as a graph at the top, I suggest the following actions:

Change line 48 to: graph = sparse.csr_matrix(graph)
Delete lines 49-51

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this. @jnothman?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be compliant with jnothman's suggestion on May 8th about converting and storing it as a graph at the top, I suggest the following actions:

What I said was "Can we just convert and store as graph at the top of the function?"

What I meant was "Can we just convert the matrix to an array and assign it back to the `graph variable at the top of the function?"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right -- it is converted to CSR on the following line.

Actually, since it must be a symmetric matrix, its row and column matrix representations are equivalent. Thus it was passing the tests.

@glemaitre glemaitre self-requested a review June 20, 2019 12:38
sinanh and others added 8 commits August 17, 2019 17:21
converting the graph to sparse.csc_matrix was passing the tests, because it must be a symmetric matrix. So its column matrix (csc) and row matrix (csr) are equivalent.
converting the graph to sparse.csc_matrix was passing the tests, because it must be a symmetric matrix. So its column matrix (csc) and row matrix (csr) are equivalent.
@sinanh sinanh force-pushed the fix_issue_10993 branch 3 times, most recently from a508eb5 to d4830e0 Compare August 20, 2019 13:17
@cmarmo
Copy link
Contributor

cmarmo commented Aug 25, 2020

Hi @sinanh , thanks for your patience! Are you still interested in working on this PR? If so do you mind fixing conflicts? Thanks!

@sinanh
Copy link
Author

sinanh commented Aug 26, 2020

Hi @sinanh , thanks for your patience! Are you still interested in working on this PR? If so do you mind fixing conflicts? Thanks!

I fixed the conflicts .

@cmarmo
Copy link
Contributor

cmarmo commented Aug 26, 2020

Thanks @sinanh! @jnothman , @glemaitre maybe you might want to have a look? Thanks!

@@ -44,6 +44,9 @@ def _graph_connected_component(graph, node_id):
node.
"""
n_node = graph.shape[0]
if isinstance(graph, np.matrix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead, can we just pass np.asarray(graph) to _graph_connected_component, instead of passing graph unchanged?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman OK, I did that. Also I noticed since numpy 1.15, it is advised not to use np.matrix. The following is written in the manual:

"It is no longer recommended to use this class, even for linear algebra. Instead use regular arrays. The class may be removed in the future."

@cmarmo
Copy link
Contributor

cmarmo commented Sep 4, 2020

Hi @sinanh, I guess you closed as you are tired of waiting? I can't do more than relabeling as "waiting for review"... but I wonder if you could reconsider...thanks for your work, anyway.

@sinanh
Copy link
Author

sinanh commented Sep 4, 2020

Hi @sinanh, I guess you closed as you are tired of waiting? I can't do more than relabeling as "waiting for review"... but I wonder if you could reconsider...thanks for your work, anyway.

@cmarmo thanks, the thing is from numpy 1.15 it is advised not to use np.matrix, so this issue is not an issue anymore I think. I guess both the issue and PR can be closed.

@cmarmo
Copy link
Contributor

cmarmo commented Sep 4, 2020

@sinanh, understood: thanks for clarifying. If you are still interested in working on scikit-learn, feel free to pick one among the "help wanted" issues. Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sklearn.cluster.spectral_clustering fails with np.matrix input
6 participants