-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
FIX normalization in semi_supervised label_propagation #31924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@snath-xoc @antoinebaker could you have a look here please? |
I can take a look, thank you for the ping |
@@ -143,6 +144,25 @@ def test_sparse_input_types( | |||
assert_array_equal(clf.predict([[0.5, 2.5]]), np.array([1])) | |||
|
|||
|
|||
@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES) | |||
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is tested only for ESTIMATORS[1:2], I tried it with LabelSpreading estimators as well and it fails.... something to perhaps investigate further (and for now add in XFAIL), unless there are any insights as to why it's expected to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first and third ESTIMATORS
use method rbf
which creates a dense affinity_matrix
.
And the fourth and following ESTIMATORS
use the LabelSpreading
class that constructs a laplacian_matrix instead of an affinity_matrix, so the normalization is different.
It might be cleaner to inline the Estimator and parameters here instead of fixtures since we are only testing one case. I agree that it looks strange to use only one of the list of Estimators, but that's all we want to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon reflection, I think it is worthwhile to test both dense and sparse cases for LabelPropagation._build_graph
. So I've included the suggestion to use ESTIMATORS[:2]
.
I haven't added any tests for LabelSpreading._build_graph
because the rows are not supposed to sum to 1 there. [The result there is not normalized beyond the normalization done while computing the laplacian from the affinity matrix].
@@ -143,6 +144,25 @@ def test_sparse_input_types( | |||
assert_array_equal(clf.predict([[0.5, 2.5]]), np.array([1])) | |||
|
|||
|
|||
@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES) | |||
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2]) | |
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[:2]) |
if sparse.isspmatrix(affinity_matrix): | ||
normalizer = np.ravel(normalizer) | ||
# common case: knn method gives row sum k for all rows | ||
if np.all(normalizer == normalizer[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if np.all(normalizer == normalizer[0]): | |
# but sometimes row sums are not the same | |
# and need to account for that | |
if not np.all(normalizer == normalizer[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this may be cleaner, but feel free to ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure I understand correctly:
This suggestion leaves affinity_matrix
unnormalized in the case when all rows sum to the same value.
And you are saying it feels cleaner not to divide the whole matrix by that scalar. So I think you are saying that we can get by without normalizing because we normalize in the subsequent iterative process so we converge to the same value.
Did I get that right? Or did you mean to normalize the rows of the affinity matrix later?
normalizer = np.ravel(normalizer) | ||
# common case: knn method gives row sum k for all rows | ||
if np.all(normalizer == normalizer[0]): | ||
affinity_matrix.data /= normalizer[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
affinity_matrix.data /= normalizer[0] |
# common case: knn method gives row sum k for all rows | ||
if np.all(normalizer == normalizer[0]): | ||
affinity_matrix.data /= normalizer[0] | ||
else: # row sums not the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: # row sums not the same |
Fixes #31872 : strange normalization in semi-supervised label propagation
The trouble briefly:
semi_supervised
use axis=1. This does not cause errors so long as we have symmetric affinity_matrix. The dense case arises for kernel"rbf"
which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect."knn"
which has all rows sum tok
. But if someone provides their own kernel the normalization could be incorrect.This PR adds tests to of proper normalization that agrees between sparse and dense.
It also adjusts the code so it can work with sparse arrays or sparse matrices.
The tests check that normalization agrees between dense and sparse cases even if the affinity_matrix is not symmetric and does not have equal row sums. The errors corrected here do not arise for users who use the sklearn kernel options.
I discovered this when working on making sure sparse arrays and sparse matrices result in the same values (#31177). This PR splits it out of the other PR because it corrects/changes the current code and adds a test. Separating it from the large number of changes in the other PR is prudent, and eases review.