Skip to content

[MRG] Fixes sklearn.metrics.silhouette_samples for sparse matrices #18723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

sahilgupta2105
Copy link

@sahilgupta2105 sahilgupta2105 commented Nov 1, 2020

Reference Issues/PRs

Fixes #18524

What does this implement/fix? Explain your changes.

The changes update the reduce function used for computing the intra-cluster and inter-cluster distances. The current version is failing at,
a) the pre-computed check for sparse matrices while getting the diagonal elements
b) when trying to index a sparse matrix to pass weights to np.bincount function

Any other comments?

@sahilgupta2105 sahilgupta2105 changed the title [MRG] Fixes sklearn.metrics.silhouette_samples for sparse matrices [WIP] Fixes sklearn.metrics.silhouette_samples for sparse matrices Nov 1, 2020
@sahilgupta2105 sahilgupta2105 changed the title [WIP] Fixes sklearn.metrics.silhouette_samples for sparse matrices [MRG] Fixes sklearn.metrics.silhouette_samples for sparse matrices Nov 1, 2020
@sahilgupta2105 sahilgupta2105 marked this pull request as draft November 1, 2020 19:08
@sahilgupta2105 sahilgupta2105 marked this pull request as ready for review November 1, 2020 19:48
@moi90
Copy link

moi90 commented Nov 13, 2020

A test is missing, which demonstrates the correct functioning.

Sahil Gupta added 2 commits November 14, 2020 08:31
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @sahilgupta2105 !

Comment on lines 140 to 142
sample_weights = (
D_chunk.getrow(i).toarray().squeeze()
if issparse(D_chunk) else D_chunk[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make the sparse matrix dense, row by row. What does the memory usage look like for different number of samples?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't benchmark the memory usage. But, converting the sparse matrix to dense would definitely spike it up for a large number of samples. I have updated the logic to avoid doing this

@@ -183,6 +184,14 @@ def test_silhouette_nonzero_diag(dtype):
with pytest.raises(ValueError, match='contains non-zero'):
silhouette_samples(dists, labels, metric='precomputed')

@pytest.mark.parametrize('store_order', ('row-major', 'col-major'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can parameterize with the callable:

Suggested change
@pytest.mark.parametrize('store_order', ('row-major', 'col-major'))
@pytest.mark.parametrize('to_sparse', (csr_matrix, csc_matrix))

@@ -183,6 +184,14 @@ def test_silhouette_nonzero_diag(dtype):
with pytest.raises(ValueError, match='contains non-zero'):
silhouette_samples(dists, labels, metric='precomputed')

@pytest.mark.parametrize('store_order', ('row-major', 'col-major'))
def test_silhouette_sparse_input(store_order):
# Ensure that silhouette_samples works for sparse matrix inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to push for use docstrings instead of comments to document test.

Suggested change
# Ensure that silhouette_samples works for sparse matrix inputs
"""Ensure that silhouette_samples works for sparse matrix inputs"""

@sahilgupta2105
Copy link
Author

@thomasjpfan @moi90 Can you please take another look. I have made the updates based on your comments

Comment on lines +142 to +143
sample_weights = current_chunk.data
sample_labels = np.take(labels, current_chunk.indices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better.

I think we would need a test to make sure that sample_weights work as expected. A simple test would be to run silhouette_samples on a sparse dataset and compare it to the result with the same dataset but densified.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan I have added a test for this

Comment on lines +198 to +208
def test_silhouette_sparse_implementation():
""" Ensure implementation for sparse matrix works correctly"""
X = np.array([[0, 0], [1, 0], [10, 10], [10, 11]], dtype=np.float32)
y = np.array([1, 1, 1, 0])
pdist = pairwise_distances(X)
sX = csr_matrix(pdist)
sparse_out = silhouette_samples(sX, y, metric="precomputed")
dense_out = silhouette_samples(pdist, y, metric="precomputed")

for out in zip(sparse_out, dense_out):
assert out[0] == out[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test a sparse pdist that have some non-diagonal zeros.

from numpy.testing import assert_allclose

def test_silhouette_sparse_implementation():
    """ Ensure implementation for sparse matrix works correctly"""
    X = np.array([[0.2, 0.1, 0.1, 0.2, 0.1, 1.6, 0.2, 0.1]],
                 dtype=np.float32).T
    y = [0, 0, 0, 0, 1, 1, 1, 1]
    pdist_dense = pairwise_distances(X)
    pdist_sparse = csr_matrix(pdist_dense)
    sparse_out = silhouette_samples(pdist_sparse, y, metric="precomputed")
    dense_out = silhouette_samples(pdist_dense, y, metric="precomputed")
    assert_allclose(sparse_out, dense_out)

Base automatically changed from master to main January 22, 2021 10:53
@awinml
Copy link
Contributor

awinml commented Oct 16, 2022

@sahilgupta2105 As this PR has been marked as stalled, I would like to take over and make the suggested changes.

@awinml
Copy link
Contributor

awinml commented Oct 16, 2022

/take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:metrics Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sklearn.metrics.silhouette_samples does not work with sparse matrices
5 participants