Skip to content

Strange normalization of semi-supervised label propagation in _build_graph #31872

@dschult

Description

@dschult

The method _build_graph on the LabelPropagation class in sklearn/semi_supervised/_label_propagation.py (line 455) treats normalization differently for sparse and dense kernels. I have questions about both of them.

** (Edited) Summary **
Troubles with the current code normalization:

  • In the dense affinity_matrix case, the current code sums axis=0 and then divides the rows by these sums. Other normalizations in semi_supervised use axis=1 (as this case should). This does not cause incorrect result so long as we have symmetric affinity_matrices. The dense case arises for kernel "rbf" which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect.
  • In the sparse affinity_matrix case, the current code divides all rows by the sum of the first row. This is not standard normalization, but does not cause errors so long as the row sums are all the same. The sparse case arises for kernel "knn" which has all rows sum to k. But if someone provides their own kernel the normalization could be incorrect.
  • The normalization is different for the dense and sparse cases, which could be confusing to someone writing their own kernel.

The fix involves changing axis=0 to axis=1 and correcting the sparse case to divide each row by its sum when the row sums are not all equal.

original somewhat rambling description

** Summary **
The method returns a different affinity_matrix for sparse and for dense versions of the same kernel matrix. Neither sparse nor dense versions normalize the usual way (columns sum to 1). The dense case is correct for symmetric input kernels. The sparse case scales all values by a constant instead of by column sums.

I suspect the results still converge in most non-symmetric cases. That's probably why this hasn't caused any issues. But some label_propagation issues like #8008, #11784, or #9722 could possibly be related.

sparse treatment

The sparse treatment is not really normalization: scaling the entire matrix by a scalar instead of normalizing by the rows or columns. I guess it is harmless to divide the entire matrix by a scalar, but there is enough code effort involved that it seems the intent was not what actually happens. The normalization is done via:

            affinity_matrix.data /= np.diag(np.array(normalizer))

where normalizer = affinity_matrix.sum(axis=0). So normalizer is of type np.matrix with shape (1, n) for n columns. The np.array turns that into an ndarray -- but it leaves the shape as 2D. Then np.diag takes the main diagonal of a 2D row matrix. Note that np.diag takes 1D->2D and 2D->1D. So it is easy to get something different from what you expect if the dimension is 2D instead of 1D. The result here is just the first entry in the row. We go through the trouble of summing across axis 0, but then we divide the whole matrix by the first column's sum.

I think this is a bug stemming from assuming normalizer to be a 1D object. Then np.diag would construct a diagonal matrix with the sums on the main diagonal. Dividing the whole affinity matrix by that diagonal matrix would result in each column being divided by its sum, which is a typical normalization. A second problem here is that we do not divide affinity_matrix. Instead we divide affinity_matrix.data. So there is no concept of a column in this division.

If I am correct, one fix would be to avoid trying to divide by a diagonal matrix. Instead divide each data value by the sum corresponding to the column for that data value. We could use np.ravel to avoid the 2D indexing issues. Something like:

            affinity_matrix.data /= np.ravel(normalizer)[affinity_matrix.indices]

dense treatment

The dense treatment scales each row by the column sums (instead of each column by the column sum). Perhaps this is what it should be, but it is not the standard normaliztion of either the columns or the rows. The relevant code is:

            affinity_matrix /= normalizer[:, np.newaxis]

In this case, normalizer is 1D (affinity_matrix is ndarray). So the index [:, np.newaxis] results in broadcasting the column sums along the rows, e.g. row 3 is a repetition of the sum of column 3. The more natural code would be

            affinity_matrix /= normalizer

and results in scaling each column by the column sums so the resulting columns add to 1. Note that for square symmetric affinity_matrix row sums and column sums are the same. So, it is possible we are only supporting symmetric kernels. But there are tests without symmetric affinity_matrix. I haven't figured out why the method still seems to "work" even though the scaling is very strange.

Speculation

dense I suspect the original code used ndarray only, and tested symmetric affinity_matrix. It looks like the code mixes up rows and columns when indexing to prepare for broadcasting.
sparse I suspect this code was an attempt to normalize using matrix division by a diagonal matrix. But the code lost track of np.diag switching from expanding 1D->2D to reducing 2D->1D when the input is 2D instead of 1D. The result is shape (1, 1), so it still broadcasts to any size input, thus no exception is raised.

doc_string The doc_string mentions that this method may not result in a normalized matrix. But it justifies that only briefly. It is true that the doc_string is not public (the method is private). But it should be revised if/when this gets fixed.

Notes: This method needs to be updated when shifting from spmatrix to sparray because the sparse sum is no longer 2D np.matrix. It is 1D np.ndarray. That's how I discovered this rabbit hole. :)

I can make a PR for this, but I wanted to make sure my perspective is correct first. Do we want the affinity matrix to be stochastic (with columns summing to 1)? Or is there another approach this code uses that I'm not seeing. Thanks...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions