-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fixes sklearn.metrics.silhouette_samples for sparse matrices #18723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A test is missing, which demonstrates the correct functioning. |
…into feature/18524 Merge changes from remote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @sahilgupta2105 !
sample_weights = ( | ||
D_chunk.getrow(i).toarray().squeeze() | ||
if issparse(D_chunk) else D_chunk[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make the sparse matrix dense, row by row. What does the memory usage look like for different number of samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't benchmark the memory usage. But, converting the sparse matrix to dense would definitely spike it up for a large number of samples. I have updated the logic to avoid doing this
@@ -183,6 +184,14 @@ def test_silhouette_nonzero_diag(dtype): | |||
with pytest.raises(ValueError, match='contains non-zero'): | |||
silhouette_samples(dists, labels, metric='precomputed') | |||
|
|||
@pytest.mark.parametrize('store_order', ('row-major', 'col-major')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can parameterize with the callable:
@pytest.mark.parametrize('store_order', ('row-major', 'col-major')) | |
@pytest.mark.parametrize('to_sparse', (csr_matrix, csc_matrix)) |
@@ -183,6 +184,14 @@ def test_silhouette_nonzero_diag(dtype): | |||
with pytest.raises(ValueError, match='contains non-zero'): | |||
silhouette_samples(dists, labels, metric='precomputed') | |||
|
|||
@pytest.mark.parametrize('store_order', ('row-major', 'col-major')) | |||
def test_silhouette_sparse_input(store_order): | |||
# Ensure that silhouette_samples works for sparse matrix inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to push for use docstrings instead of comments to document test.
# Ensure that silhouette_samples works for sparse matrix inputs | |
"""Ensure that silhouette_samples works for sparse matrix inputs""" |
@thomasjpfan @moi90 Can you please take another look. I have made the updates based on your comments |
sample_weights = current_chunk.data | ||
sample_labels = np.take(labels, current_chunk.indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better.
I think we would need a test to make sure that sample_weights
work as expected. A simple test would be to run silhouette_samples
on a sparse dataset and compare it to the result with the same dataset but densified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan I have added a test for this
def test_silhouette_sparse_implementation(): | ||
""" Ensure implementation for sparse matrix works correctly""" | ||
X = np.array([[0, 0], [1, 0], [10, 10], [10, 11]], dtype=np.float32) | ||
y = np.array([1, 1, 1, 0]) | ||
pdist = pairwise_distances(X) | ||
sX = csr_matrix(pdist) | ||
sparse_out = silhouette_samples(sX, y, metric="precomputed") | ||
dense_out = silhouette_samples(pdist, y, metric="precomputed") | ||
|
||
for out in zip(sparse_out, dense_out): | ||
assert out[0] == out[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test a sparse pdist that have some non-diagonal zeros.
from numpy.testing import assert_allclose
def test_silhouette_sparse_implementation():
""" Ensure implementation for sparse matrix works correctly"""
X = np.array([[0.2, 0.1, 0.1, 0.2, 0.1, 1.6, 0.2, 0.1]],
dtype=np.float32).T
y = [0, 0, 0, 0, 1, 1, 1, 1]
pdist_dense = pairwise_distances(X)
pdist_sparse = csr_matrix(pdist_dense)
sparse_out = silhouette_samples(pdist_sparse, y, metric="precomputed")
dense_out = silhouette_samples(pdist_dense, y, metric="precomputed")
assert_allclose(sparse_out, dense_out)
@sahilgupta2105 As this PR has been marked as stalled, I would like to take over and make the suggested changes. |
/take |
Reference Issues/PRs
Fixes #18524
What does this implement/fix? Explain your changes.
The changes update the reduce function used for computing the intra-cluster and inter-cluster distances. The current version is failing at,
a) the pre-computed check for sparse matrices while getting the diagonal elements
b) when trying to index a sparse matrix to pass weights to np.bincount function
Any other comments?