Skip to content

FIX normalization in semi_supervised label_propagation #31924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dschult
Copy link
Contributor

@dschult dschult commented Aug 11, 2025

Fixes #31872 : strange normalization in semi-supervised label propagation

The trouble briefly:

  • In the dense affinity_matrix case, the current code sums axis=0 and then divides the rows by these sums. Other normalizations in semi_supervised use axis=1. This does not cause errors so long as we have symmetric affinity_matrix. The dense case arises for kernel "rbf" which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect.
  • In the sparse affinity_matrix case, the current code divides all rows by the sum of the first row. This does not cause errors so long as the row sums are all the same. The sparse case arises for kernel "knn" which has all rows sum to k. But if someone provides their own kernel the normalization could be incorrect.
  • The normalization is different for the dense and sparse cases, which could be confusing to someone writing their own kernel.

This PR adds tests to of proper normalization that agrees between sparse and dense.
It also adjusts the code so it can work with sparse arrays or sparse matrices.

The tests check that normalization agrees between dense and sparse cases even if the affinity_matrix is not symmetric and does not have equal row sums. The errors corrected here do not arise for users who use the sklearn kernel options.

I discovered this when working on making sure sparse arrays and sparse matrices result in the same values (#31177). This PR splits it out of the other PR because it corrects/changes the current code and adds a test. Separating it from the large number of changes in the other PR is prudent, and eases review.

Copy link

github-actions bot commented Aug 11, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e21c436. Link to the linter CI: here

@dschult dschult changed the title Sparse normalizer MAINT: Fix normalization in semi_supervised label_propagation Aug 11, 2025
@dschult dschult changed the title MAINT: Fix normalization in semi_supervised label_propagation FIX normalization in semi_supervised label_propagation Aug 11, 2025
@adrinjalali
Copy link
Member

@snath-xoc @antoinebaker could you have a look here please?

@snath-xoc
Copy link
Contributor

I can take a look, thank you for the ping ☺️

@@ -143,6 +144,25 @@ def test_sparse_input_types(
assert_array_equal(clf.predict([[0.5, 2.5]]), np.array([1]))


@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES)
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2])
Copy link
Contributor

@snath-xoc snath-xoc Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is tested only for ESTIMATORS[1:2], I tried it with LabelSpreading estimators as well and it fails.... something to perhaps investigate further (and for now add in XFAIL), unless there are any insights as to why it's expected to fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first and third ESTIMATORS use method rbf which creates a dense affinity_matrix.
And the fourth and following ESTIMATORS use the LabelSpreading class that constructs a laplacian_matrix instead of an affinity_matrix, so the normalization is different.

It might be cleaner to inline the Estimator and parameters here instead of fixtures since we are only testing one case. I agree that it looks strange to use only one of the list of Estimators, but that's all we want to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reflection, I think it is worthwhile to test both dense and sparse cases for LabelPropagation._build_graph. So I've included the suggestion to use ESTIMATORS[:2].

I haven't added any tests for LabelSpreading._build_graph because the rows are not supposed to sum to 1 there. [The result there is not normalized beyond the normalization done while computing the laplacian from the affinity matrix].

@@ -143,6 +144,25 @@ def test_sparse_input_types(
assert_array_equal(clf.predict([[0.5, 2.5]]), np.array([1]))


@pytest.mark.parametrize("constructor", CONSTRUCTOR_TYPES)
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[1:2])
@pytest.mark.parametrize("Estimator, parameters", ESTIMATORS[:2])

if sparse.isspmatrix(affinity_matrix):
normalizer = np.ravel(normalizer)
# common case: knn method gives row sum k for all rows
if np.all(normalizer == normalizer[0]):
Copy link
Contributor

@snath-xoc snath-xoc Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if np.all(normalizer == normalizer[0]):
# but sometimes row sums are not the same
# and need to account for that
if not np.all(normalizer == normalizer[0]):

Copy link
Contributor

@snath-xoc snath-xoc Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this may be cleaner, but feel free to ignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I understand correctly:
This suggestion leaves affinity_matrix unnormalized in the case when all rows sum to the same value.
And you are saying it feels cleaner not to divide the whole matrix by that scalar. So I think you are saying that we can get by without normalizing because we normalize in the subsequent iterative process so we converge to the same value.

Did I get that right? Or did you mean to normalize the rows of the affinity matrix later?

normalizer = np.ravel(normalizer)
# common case: knn method gives row sum k for all rows
if np.all(normalizer == normalizer[0]):
affinity_matrix.data /= normalizer[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
affinity_matrix.data /= normalizer[0]

# common case: knn method gives row sum k for all rows
if np.all(normalizer == normalizer[0]):
affinity_matrix.data /= normalizer[0]
else: # row sums not the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else: # row sums not the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange normalization of semi-supervised label propagation in _build_graph
3 participants