Strange normalization of semi-supervised label propagation in `_build_graph`

The method `_build_graph` on the `LabelPropagation` class in `sklearn/semi_supervised/_label_propagation.py` [(line 455)](https://github.com/scikit-learn/scikit-learn/blob/7d1d96819172e2a7c826f04c68b9d93188cf6a92/sklearn/semi_supervised/_label_propagation.py#L455) treats normalization differently for sparse and dense kernels. I have questions about both of them.

** (Edited) Summary **
Troubles with the current code normalization:
- In the dense affinity_matrix case, the current code sums axis=0 and then divides the rows by these sums. Other normalizations in semi_supervised use axis=1 (as this case should). This does not cause incorrect result so long as we have symmetric affinity_matrices. The dense case arises for kernel "rbf" which provides symmetric matrices. But if someone provides their own kernel the normalization could be incorrect.
- In the sparse affinity_matrix case, the current code divides all rows by the sum of the first row. This is not standard normalization, but does not cause errors so long as the row sums are all the same. The sparse case arises for kernel "knn" which has all rows sum to k. But if someone provides their own kernel the normalization could be incorrect.
- The normalization is different for the dense and sparse cases, which could be confusing to someone writing their own kernel.

The fix involves changing `axis=0` to `axis=1` and correcting the sparse case to divide each row by its sum when the row sums are not all equal.

<details>

<summary> original somewhat rambling description </summary>

** Summary **
The method returns a different `affinity_matrix` for sparse and for dense versions of the same kernel matrix. Neither sparse nor dense versions normalize the usual way (columns sum to 1). The dense case is correct for symmetric input kernels. The sparse case scales all values by a constant instead of by column sums.

I suspect the results still converge in most non-symmetric cases. That's probably why this hasn't caused any issues. But some `label_propagation` issues like #8008, #11784, or #9722 could possibly be related.

**sparse treatment**

The sparse treatment is not really normalization: scaling the entire matrix by a scalar instead of normalizing by the rows or columns. I guess it is harmless to divide the entire matrix by a scalar, but there is enough code effort involved that it seems the intent was not what actually happens. The normalization is done via:
```python
            affinity_matrix.data /= np.diag(np.array(normalizer))
```
where `normalizer = affinity_matrix.sum(axis=0)`.  So `normalizer` is of type `np.matrix` with shape `(1, n)` for `n` columns.  The `np.array` turns that into an `ndarray` -- but it leaves the shape as 2D. Then `np.diag` takes the main diagonal of a 2D row matrix. Note that np.diag takes 1D->2D and 2D->1D. So it is easy to get something different from what you expect if the dimension is 2D instead of 1D. The result here is just the first entry in the row.  We go through the trouble of summing across axis `0`, but then we divide the whole matrix by the first column's sum. 

I think this is a bug stemming from assuming `normalizer` to be a 1D object. Then `np.diag` would construct a diagonal matrix with the sums on the main diagonal. Dividing the whole affinity matrix by that diagonal matrix would result in each column being divided by its sum, which is a typical normalization. A second problem here is that we do not divide `affinity_matrix`. Instead we divide `affinity_matrix.data`. So there is no concept of a column in this division.

If I am correct, one fix would be to avoid trying to divide by a diagonal matrix. Instead divide each data value by the sum corresponding to the column for that data value. We could use `np.ravel` to avoid the 2D indexing issues.  Something like:
```python
            affinity_matrix.data /= np.ravel(normalizer)[affinity_matrix.indices]
```

**dense treatment**

The dense treatment scales each row by the column sums (instead of each column by the column sum). Perhaps this is what it should be, but it is not the standard normaliztion of either the columns or the rows. The relevant code is:
```python
            affinity_matrix /= normalizer[:, np.newaxis]
```
In this case, normalizer is 1D (`affinity_matrix` is `ndarray`). So the index `[:, np.newaxis]` results in broadcasting the column sums along the rows, e.g. row 3 is a repetition of the sum of column 3.  The more natural code would be
```python
            affinity_matrix /= normalizer
```
and results in scaling each column by the column sums so the resulting columns add to 1.  Note that for square symmetric `affinity_matrix` row sums and column sums are the same. So, it is possible we are only supporting symmetric kernels. But there are tests without symmetric `affinity_matrix`. I haven't figured out why the method still seems to "work" even though the scaling is very strange. 

**Speculation**

**dense** I suspect the original code used `ndarray` only, and tested symmetric `affinity_matrix`.  It looks like the code mixes up rows and columns when indexing to prepare for broadcasting. 
**sparse** I suspect this code was an attempt to normalize using matrix division by a diagonal matrix. But the code lost track of `np.diag` switching from expanding 1D->2D to reducing 2D->1D when the input is 2D instead of 1D. The result is shape `(1, 1)`, so it still broadcasts to any size input, thus no exception is raised.

**doc_string** The doc_string mentions that this method may not result in a normalized matrix. But it justifies that only briefly. It is true that the doc_string is not public (the method is private). But it should be revised if/when this gets fixed.

Notes: This method needs to be updated when shifting from `spmatrix` to `sparray` because the sparse sum is no longer 2D `np.matrix`. It is 1D `np.ndarray`. That's how I discovered this rabbit hole. :) 

I can make a PR for this, but I wanted to make sure my perspective is correct first. Do we want the affinity matrix to be stochastic (with columns summing to 1)? Or is there another approach this code uses that I'm not seeing. Thanks...

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Strange normalization of semi-supervised label propagation in `_build_graph` #31872

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Strange normalization of semi-supervised label propagation in _build_graph #31872

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Strange normalization of semi-supervised label propagation in `_build_graph` #31872