Skip to content

FIX normalization in semi_supervised label_propagation #31924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dschult
Copy link
Contributor

@dschult dschult commented Aug 11, 2025

Fixes #31872 : strange normalization in semi-supervised label propagation

The trouble briefly:

  • In the dense affinity_matrix case, the current code sums axis=0 and then divides the rows by these sums. Other normalizations in semi_supervised use axis=1. This does not cause errors so long as we have symmetric affinity_matrix. The dense case arises for kernel "rbf" which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect.
  • In the sparse affinity_matrix case, the current code divides all rows by the sum of the first row. This does not cause errors so long as the row sums are all the same. The sparse case arises for kernel "knn" which has all rows sum to k. But if someone provides their own kernel the normalization could be incorrect.
  • The normalization is different for the dense and sparse cases, which could be confusing to someone writing their own kernel.

This PR adds tests to of proper normalization that agrees between sparse and dense.
It also adjusts the code so it can work with sparse arrays or sparse matrices.

The tests check that normalization agrees between dense and sparse cases even if the affinity_matrix is not symmetric and does not have equal row sums. The errors corrected here do not arise for users who use the sklearn kernel options.

I discovered this when working on making sure sparse arrays and sparse matrices result in the same values (#31177). This PR splits it out of the other PR because it corrects/changes the current code and adds a test. Separating it from the large number of changes in the other PR is prudent, and eases review.

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b35d9dd. Link to the linter CI: here

@dschult dschult changed the title Sparse normalizer MAINT: Fix normalization in semi_supervised label_propagation Aug 11, 2025
@dschult dschult changed the title MAINT: Fix normalization in semi_supervised label_propagation FIX normalization in semi_supervised label_propagation Aug 11, 2025
@adrinjalali
Copy link
Member

@snath-xoc @antoinebaker could you have a look here please?

@snath-xoc
Copy link
Contributor

I can take a look, thank you for the ping ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange normalization of semi-supervised label propagation in _build_graph
3 participants