[MRG] Fix of Spectral embedding implementation #9062

jmargeta · 2017-06-08T13:43:18Z

Reference Issue

Fixes #8129
Continuation of and closes #8217 by @devanshdalal

What does this implement/fix? Explain your changes.

This fixes computation of the spectral embedding
(normalization of the spectrum by division instead of multiplication).

The first eigenvector of the graph Laplacian is now constant (property of fully connected graphs).

These changes add tests for untested codebase.
NOTE: Parts depending on pyamg are not automatically tested on Travis
Testing of all backends would require to install optional package pyamg.

Replaces AMG solver test with a more complete test
- when the pyamg dependency is installed, the original fails also on master (untouched for 5 years)
- the new test based on scikit-learn toy image segmentation example instead
compares results of all three solvers to be the same

Any other comments?

The examples are mostly visually similar. With two exceptions:

the quality of the results for discretized label assignment method in examples/cluster/plot_face_segmentation.py demo qualitatively improves .
In this example a new seed was set to avoid failed plotting for "discretize" option due to empty clusters (matplotlib's contour fail)
in examples/manifold/plot_compare_methods.py, for manifold comparison for SpectralEmbedding looses its spread. A more spread out distribution is reported for similar example in the original papers. .

jmargeta · 2017-06-08T13:44:23Z

What do you think @GaelVaroquaux @glemaitre @massich ?

agramfort · 2017-06-08T14:21:32Z

LGTM

glemaitre · 2017-06-08T15:08:40Z

@agramfort Do you have any thoughts regarding the visible difference on the S-curve dataset.
Originally, we had this: http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

The embedding is more compact now.

jmargeta · 2017-06-08T15:23:45Z

This is how it looks for the other examples:

agramfort · 2017-06-08T15:26:31Z

indeed it's not as nice ... hum

vene · 2017-06-08T18:23:25Z

sklearn/manifold/tests/test_spectral_embedding.py

+                                       random_state=seed)
+
+        assert_array_almost_equal(np.diff(embedding[:, 0]), 0)
+        assert_array_equal(np.abs(np.diff(embedding[:, 1])) > 1e-3, True)


The second test asserts that no two consecutive elements are close, which seems stricter than what is needed?

Good point @vene Thanks for your input.
All I want is a test asserting that the elements are different (not constant).
The condition can be probably relaxed a bit with something like assert_greater(np.std(embedding[:, 1]), 1e-3).
Is this any better? What would you suggest instead of the current test?

there are two goals here:

make the assertions emantically as close to what we really mean

make the assertions understandable by a new coder one year from now

I think you could use the std of the vector for both tests: for the first test the std should be nearly 0, for the second it should be nonzero. but add a small comment next to each?

A stronger idea for the first assertion: don't we mathematically know exactly what value the first component needs to have? (maybe at least when normed?) I'm not sure off the top of my head

GaelVaroquaux · 2017-06-08T20:02:00Z

Spectral embedding does look broken.

glemaitre · 2017-06-08T23:58:02Z

I spent a little bit of time checking the code and I think that the only tricky part is actually in the computation of the laplacian. I did a naive/textbook implementation:

def _laplacian_dense(graph, normed=False, axis=0): 
    a = np.array(graph)                                                                                              
    d = np.eye(a.shape[0])                                                                                           
    if normed:                                                                                                       
        _setdiag_dense(d, a.sum(axis=0))                                                                             
        l = np.matrix(d - a)                                                                                         
        d_half = np.matrix(np.linalg.inv(np.sqrt(d)))                                                                
        return np.array(d_half * l * d_half), np.array(d_half.diagonal())                                            
    else:                                                                                                            
        _setdiag_dense(d, a.sum(axis=0))                                                                             
        return np.array(d - a), np.array(d.diagonal())

and rerun the S-curve and and sphere examples and I obtained different results:

In some way they make more sense but it is late and I would need a pen and paper to get what are the difference between the current implementation of the laplacian and the above one.

glemaitre · 2017-06-09T00:02:11Z

@GaelVaroquaux I am not anymore sure if this is actually a bug. My above image are corresponding exactly to the stable example. I would suspect that the division by dd was linked with a trick in the computation of the laplacian for instance dd is D^-1/2 and not D^1/2 or something like that. However, this is pretty late :D

glemaitre · 2017-06-09T06:53:24Z

Uhm now that I am fresh, the laplacian is fine.

jmargeta · 2017-06-09T07:16:18Z

@glemaitre Was a good idea to relook into the Laplacian.
How do the first eigenvectors and the segmentation example of the racoon image with discretized label assignment look like for you now?

glemaitre · 2017-06-14T23:20:37Z

So finally there is no bug in the laplacian and I don't think that there will be one in the eigen decomposition. I think that @jmargeta looked at other implementation of the spectral embedding and got the same results than what we show earlier, if I am not wrong.

@jmargeta isn't?

glemaitre · 2017-06-14T23:21:34Z

@jmargeta if you could solve the PEP8 error and push such that we pass all the tests at least

codecov · 2017-06-15T07:16:53Z

Codecov Report

❗ No coverage uploaded for pull request base (master@e31c4f1). Click here to learn what that means.
The diff coverage is 86.36%.

@@            Coverage Diff            @@
##             master    #9062   +/-   ##
=========================================
  Coverage          ?   96.23%           
=========================================
  Files             ?      332           
  Lines             ?    59891           
  Branches          ?        0           
=========================================
  Hits              ?    57636           
  Misses            ?     2255           
  Partials          ?        0

Impacted Files	Coverage Δ
sklearn/cluster/spectral.py	`94.05% <100%> (ø)`
sklearn/manifold/tests/test_spectral_embedding.py	`96.17% <100%> (ø)`
sklearn/manifold/spectral_embedding_.py	`87.65% <50%> (ø)`
sklearn/cluster/tests/test_spectral.py	`96.46% <88.46%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e31c4f1...c9cec63. Read the comment docs.

jmargeta · 2017-06-15T07:19:16Z

@glemaitre That is correct. Even the two Matlab-based implementations do not unroll the swiss roll (same for S-curve) onto a "flat" 2d surface as I would expect.

I intend to check two more changes in construction of the affinity graph this afternoon to make a better example case (if possible) and will come back to you all.

To follow fix in spectral embedding

For fully connected graphs the first eigenvector should be constant while the others not.

Replaces AMG solver test with a more complete test - NOTE: paths depending on pyamg are not automatically tested on Travis - when the pyamg dependency is installed, the original fails also on master probably for a very long time - test based on scikit-learn toy image segmentation example instead comparing results of all three solvers to be the same

First eigenvector should be kept for clustering. It is only constant for fully connected graphs. Reverts 95d8111

Fix a seed so that the example would not fail in plotting. "discretize" might return empty clusters causing contour to fail

@vene

Semantics of the tests should be clearer now. Based on suggestions of @vene

ljwolf · 2017-12-05T18:34:22Z

Hi,

I'm using this patch locally for some spatially-constrained spectral clustering, and I see the correct behavior when using this patch.
I see undesirable behavior without it.

The difference is pretty dramatic for my use case. Namely, when you're clustering on a geographic lattice's adjacency matrix, without this fix, the discovered clusters are sometimes not contiguous. As far as I understand it, a binary adjacency matrix should yield connected clusters by this method. I think this is a more dramatic showing of what @glemaitre said when the embedding appears more compact.

An example is below from Rook contiguity for US counties. sklearn 0.19.1 is on right, this patch is on left. You see lots of enclaves at some weaker borders:

And sometimes poorly-connected nodes are not in the right cluster:

If this counts as a "better" example case @jmargeta, I'm happy to contrib.

jmargeta · 2017-12-11T12:10:33Z

build_tools/travis/install.sh

@@ -38,8 +38,7 @@ if [[ "$DISTRIB" == "conda" ]]; then
    conda update --yes conda

    TO_INSTALL="python=$PYTHON_VERSION pip pytest pytests-cov" \


Note that removing the quotes on each line also fixes the issue.
Such as:

TO_INSTALL="python=$PYTHON_VERSION pip pytest pytests-cov \ numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION \ cython=$CYTHON_VERSION"

Thanks this is more readable :) I should learn bash at some point

Me too :) Sorry for not noticing the pytests-cov typo.

glemaitre · 2017-12-11T15:19:48Z

@lesteve I think that I can have my bonus point (even if I only copy paste and mess-up my bash writing)

jnothman

I'm not extremely confident, but I think this is good...

jnothman · 2017-12-11T22:25:17Z

Apart from sorting out the CI, that is.

lesteve · 2017-12-12T15:02:04Z

build_tools/travis/install.sh

    source activate testenv

+    if [[ -n "$PYAMG_VERSION" ]]; then
+        conda install --yes pyamg=$PYAMG_VERSION


Since you can install pyamg with conda, just use TO_INSTALL before activating the virtualenv, e.g. similar to what you do for pandas.

lesteve · 2017-12-12T16:00:03Z

build_tools/travis/install.sh

-            nomkl cython=$CYTHON_VERSION \
-            ${PANDAS_VERSION+pandas=$PANDAS_VERSION}
+        TO_INSTALL="$TO_INSTALL mkl"
+    fi


here you need a else clause with nomkl.

@lesteve I do not understand why. Doesn't conda automatically install the nomkl version?
Do I assume correctly you mean this?

if [[ "$INSTALL_MKL" == "true" ]]; then TO_INSTALL="$TO_INSTALL mkl" else TO_INSTALL="$TO_INSTALL nomkl" fi

Found out why :) Change done. Well spotted.

Pyamg is added to the list of packages to be installed upon creation of conda virtual environment on Travis. Based in Loïc's comment.

Wrapped line with author list.

Based on Loïc's comments

Test that spectral clustering raises an exception if amg solver is selected but pyamg not installed.

lesteve · 2017-12-15T16:22:11Z

I don't understand all the details, but it looks like this is an improvement so I would be in favour of merging.

lesteve · 2017-12-18T09:20:23Z

Let's be slightly bold and merge this one! Thanks a lot @devanshdalal, @jmargeta and @glemaitre!

amueller · 2017-12-18T15:51:22Z

sklearn/manifold/tests/test_spectral_embedding.py

@@ -1,6 +1,6 @@
+import pytest


wait that's the first explicit dependency on pytest, right?

nop, it has been discuss in another PR:
#10081 (comment)

We also added there: https://github.com/scikit-learn/scikit-learn/blob/4f710cdd088aa8851e8b049e4faafa03767fda10/sklearn/preprocessing/tests/test_target.py

There is some in the ColumnTransformer.

Do you wish to revert it?

No that's good, sorry I'm out of the loop.

we should have ping you there thought

I'm sure that wouldn't have gotten lost in my 17615 unread scikit-learn notifications ;) Hm can I filter stuff that I'm pinged in... good question...

oh that reduces it to only 5782 (only unread threads that I'm subscribed to)!

amueller · 2017-12-18T15:53:59Z

Btw, is this related to #811? Or was #811 just a feature request?

glemaitre · 2017-12-18T16:29:07Z

uhm I have to check it. FYI, since #9077 we completely rely on scipy for the computation of the Laplacian and we don't have any backport.

glemaitre · 2017-12-18T16:31:18Z

If there is a bug we have to check there:
https://github.com/scipy/scipy/blob/v1.0.0/scipy/sparse/csgraph/_laplacian.py#L114

amueller · 2017-12-18T16:32:03Z

oh, well, #811 is just totally out of date then and I'll close it ;)

vene reviewed Jun 8, 2017

View reviewed changes

vene mentioned this pull request Jun 9, 2017

[MRG+2] MAINT: no longer backport graph_laplacian, use scipy one #9077

Merged

Devansh D and others added 11 commits June 15, 2017 12:30

spectral_embedding bug-fix and suggestions

b6035db

spectral_embedding bug-fix and suggestions

b6febed

spectral_embedding bug-fix and suggestions

66a48c4

Fix spectral clustering

68f3bcd

To follow fix in spectral embedding

Fix style of spectral_embedding

d237fda

Add test of properties of graph Laplacian spectrum

9ad8e86

For fully connected graphs the first eigenvector should be constant while the others not.

Fix spectral clustering

8d601b0

First eigenvector should be kept for clustering. It is only constant for fully connected graphs. Reverts 95d8111

Change seed for face segmentation example

de7e44e

Fix a seed so that the example would not fail in plotting. "discretize" might return empty clusters causing contour to fail

Revert to the use of shortened diffusion maps

ff55272

Silence unused import error

fea5b83

jmargeta force-pushed the devanshdalal/fix-#8129 branch from c9cec63 to fea5b83 Compare June 15, 2017 10:56

Make Laplacian eigenvector tests more meaninful

5be8742

Semantics of the tests should be clearer now. Based on suggestions of @vene

glemaitre added 2 commits December 11, 2017 13:04

FIX bash indent

bdc20ed

FIX forget white space

91e4d74

jmargeta commented Dec 11, 2017

View reviewed changes

glemaitre added 3 commits December 11, 2017 13:15

jan comments

9b2bf5c

iter

67369b4

typo

650ad38

jnothman approved these changes Dec 11, 2017

View reviewed changes

lesteve reviewed Dec 12, 2017

View reviewed changes

jmargeta added 5 commits December 12, 2017 17:04

Install pyamg before activating the virtual env

a20d693

Pyamg is added to the list of packages to be installed upon creation of conda virtual environment on Travis. Based in Loïc's comment.

Fix line wrapping in v0.20.rst

3f5669d

Wrapped line with author list.

Fix installation of nomkl conda in Travis build

d01f9cb

Based on Loïc's comments

Improve amg solver test coverage

274ff78

Test that spectral clustering raises an exception if amg solver is selected but pyamg not installed.

Fix style by removing a blank line

b01dd33

lesteve merged commit d01cdc2 into scikit-learn:master Dec 18, 2017

jmargeta deleted the devanshdalal/fix-#8129 branch December 18, 2017 11:16

amueller reviewed Dec 18, 2017

View reviewed changes

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Fix spectral embedding implementation (scikit-learn#9062)

54c402b

jotasi mentioned this pull request Jan 25, 2018

[MRG + 1] Remove pilutil from examples #10527

Merged

sky88088 mentioned this pull request Mar 2, 2018

[WIP] Improve spectral clustering implementation #10739

Closed

qinhanmin2014 mentioned this pull request Aug 13, 2018

[MRG] MNT Tools for working with what's new #11800

Merged

cmarmo mentioned this pull request Mar 1, 2021

graph laplacian difference #811

Closed

		@@ -38,8 +38,7 @@ if [[ "$DISTRIB" == "conda" ]]; then
		conda update --yes conda

		TO_INSTALL="python=$PYTHON_VERSION pip pytest pytests-cov" \

Uh oh!

[MRG] Fix of Spectral embedding implementation #9062

[MRG] Fix of Spectral embedding implementation #9062

Uh oh!

Conversation

jmargeta commented Jun 8, 2017 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jmargeta commented Jun 8, 2017

Uh oh!

agramfort commented Jun 8, 2017

Uh oh!

glemaitre commented Jun 8, 2017

Uh oh!

jmargeta commented Jun 8, 2017

Uh oh!

agramfort commented Jun 8, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmargeta Jun 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jun 8, 2017

Uh oh!

glemaitre commented Jun 8, 2017

Uh oh!

glemaitre commented Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Jun 9, 2017 via email

Uh oh!

jmargeta commented Jun 9, 2017

Uh oh!

glemaitre commented Jun 14, 2017

Uh oh!

glemaitre commented Jun 14, 2017

Uh oh!

codecov bot commented Jun 15, 2017

Codecov Report

Uh oh!

jmargeta commented Jun 15, 2017

Uh oh!

ljwolf commented Dec 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Dec 11, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Dec 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmargeta Dec 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve commented Dec 15, 2017

Uh oh!

lesteve commented Dec 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmargeta commented Jun 8, 2017 •

edited by jnothman

Loading

jmargeta Jun 8, 2017 •

edited

Loading

glemaitre commented Jun 9, 2017 •

edited

Loading

jmargeta Dec 12, 2017 •

edited

Loading

glemaitre Dec 18, 2017 •

edited

Loading