Skip to content

[MRG] Fix of Spectral embedding implementation #9062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Dec 18, 2017

Conversation

jmargeta
Copy link
Contributor

@jmargeta jmargeta commented Jun 8, 2017

Reference Issue

Fixes #8129
Continuation of and closes #8217 by @devanshdalal

What does this implement/fix? Explain your changes.

This fixes computation of the spectral embedding
(normalization of the spectrum by division instead of multiplication).

The first eigenvector of the graph Laplacian is now constant (property of fully connected graphs).

These changes add tests for untested codebase.
NOTE: Parts depending on pyamg are not automatically tested on Travis
Testing of all backends would require to install optional package pyamg.

Replaces AMG solver test with a more complete test
- when the pyamg dependency is installed, the original fails also on master (untouched for 5 years)
- the new test based on scikit-learn toy image segmentation example instead
compares results of all three solvers to be the same

Any other comments?

The examples are mostly visually similar. With two exceptions:

  • the quality of the results for discretized label assignment method in examples/cluster/plot_face_segmentation.py demo qualitatively improves discretize.
    In this example a new seed was set to avoid failed plotting for "discretize" option due to empty clusters (matplotlib's contour fail)

  • in examples/manifold/plot_compare_methods.py, for manifold comparison for SpectralEmbedding looses its spread. A more spread out distribution is reported for similar example in the original papers. manifold_method_comparison.

@jmargeta
Copy link
Contributor Author

jmargeta commented Jun 8, 2017

What do you think @GaelVaroquaux @glemaitre @massich ?

@agramfort
Copy link
Member

LGTM

@glemaitre
Copy link
Member

@agramfort Do you have any thoughts regarding the visible difference on the S-curve dataset.
Originally, we had this: http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

The embedding is more compact now.

@jmargeta
Copy link
Contributor Author

jmargeta commented Jun 8, 2017

This is how it looks for the other examples:
digits
plot_manifold_sphere

@agramfort
Copy link
Member

agramfort commented Jun 8, 2017 via email

random_state=seed)

assert_array_almost_equal(np.diff(embedding[:, 0]), 0)
assert_array_equal(np.abs(np.diff(embedding[:, 1])) > 1e-3, True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second test asserts that no two consecutive elements are close, which seems stricter than what is needed?

Copy link
Contributor Author

@jmargeta jmargeta Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @vene Thanks for your input.
All I want is a test asserting that the elements are different (not constant).
The condition can be probably relaxed a bit with something like assert_greater(np.std(embedding[:, 1]), 1e-3).
Is this any better? What would you suggest instead of the current test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two goals here:

  1. make the assertions emantically as close to what we really mean
  2. make the assertions understandable by a new coder one year from now

I think you could use the std of the vector for both tests: for the first test the std should be nearly 0, for the second it should be nonzero. but add a small comment next to each?

A stronger idea for the first assertion: don't we mathematically know exactly what value the first component needs to have? (maybe at least when normed?) I'm not sure off the top of my head

@GaelVaroquaux
Copy link
Member

Spectral embedding does look broken.

@glemaitre
Copy link
Member

I spent a little bit of time checking the code and I think that the only tricky part is actually in the computation of the laplacian. I did a naive/textbook implementation:

def _laplacian_dense(graph, normed=False, axis=0): 
    a = np.array(graph)                                                                                              
    d = np.eye(a.shape[0])                                                                                           
    if normed:                                                                                                       
        _setdiag_dense(d, a.sum(axis=0))                                                                             
        l = np.matrix(d - a)                                                                                         
        d_half = np.matrix(np.linalg.inv(np.sqrt(d)))                                                                
        return np.array(d_half * l * d_half), np.array(d_half.diagonal())                                            
    else:                                                                                                            
        _setdiag_dense(d, a.sum(axis=0))                                                                             
        return np.array(d - a), np.array(d.diagonal())

and rerun the S-curve and and sphere examples and I obtained different results:
figure_1
figure_2

In some way they make more sense but it is late and I would need a pen and paper to get what are the difference between the current implementation of the laplacian and the above one.

@glemaitre
Copy link
Member

glemaitre commented Jun 9, 2017

@GaelVaroquaux I am not anymore sure if this is actually a bug. My above image are corresponding exactly to the stable example. I would suspect that the division by dd was linked with a trick in the computation of the laplacian for instance dd is D^-1/2 and not D^1/2 or something like that. However, this is pretty late :D

@glemaitre
Copy link
Member

glemaitre commented Jun 9, 2017 via email

@jmargeta
Copy link
Contributor Author

jmargeta commented Jun 9, 2017

@glemaitre Was a good idea to relook into the Laplacian.
How do the first eigenvectors and the segmentation example of the racoon image with discretized label assignment look like for you now?

@glemaitre
Copy link
Member

So finally there is no bug in the laplacian and I don't think that there will be one in the eigen decomposition. I think that @jmargeta looked at other implementation of the spectral embedding and got the same results than what we show earlier, if I am not wrong.

@jmargeta isn't?

@glemaitre
Copy link
Member

@jmargeta if you could solve the PEP8 error and push such that we pass all the tests at least

@codecov
Copy link

codecov bot commented Jun 15, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@e31c4f1). Click here to learn what that means.
The diff coverage is 86.36%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9062   +/-   ##
=========================================
  Coverage          ?   96.23%           
=========================================
  Files             ?      332           
  Lines             ?    59891           
  Branches          ?        0           
=========================================
  Hits              ?    57636           
  Misses            ?     2255           
  Partials          ?        0
Impacted Files Coverage Δ
sklearn/cluster/spectral.py 94.05% <100%> (ø)
sklearn/manifold/tests/test_spectral_embedding.py 96.17% <100%> (ø)
sklearn/manifold/spectral_embedding_.py 87.65% <50%> (ø)
sklearn/cluster/tests/test_spectral.py 96.46% <88.46%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e31c4f1...c9cec63. Read the comment docs.

@jmargeta
Copy link
Contributor Author

@glemaitre That is correct. Even the two Matlab-based implementations do not unroll the swiss roll (same for S-curve) onto a "flat" 2d surface as I would expect.

I intend to check two more changes in construction of the affinity graph this afternoon to make a better example case (if possible) and will come back to you all.

Devansh D and others added 11 commits June 15, 2017 12:30
To follow fix in spectral embedding
For fully connected graphs the first eigenvector should be constant
while the others not.
Replaces AMG solver test with a more complete test
 - NOTE: paths depending on pyamg are not automatically tested on Travis
 - when the pyamg dependency is installed, the original fails also on master
   probably for a very long time
 - test based on scikit-learn toy image segmentation example instead
   comparing results of all three solvers to be the same
First eigenvector should be kept for clustering.
It is only constant for fully connected graphs.

Reverts 95d8111
Fix a seed so that the example would not fail in plotting.
"discretize" might return empty clusters causing contour to fail
@jmargeta jmargeta force-pushed the devanshdalal/fix-#8129 branch from c9cec63 to fea5b83 Compare June 15, 2017 10:56
Semantics of the tests should be clearer now.
Based on suggestions of @vene
@ljwolf
Copy link
Contributor

ljwolf commented Dec 5, 2017

Hi,

I'm using this patch locally for some spatially-constrained spectral clustering, and I see the correct behavior when using this patch.
I see undesirable behavior without it.

The difference is pretty dramatic for my use case. Namely, when you're clustering on a geographic lattice's adjacency matrix, without this fix, the discovered clusters are sometimes not contiguous. As far as I understand it, a binary adjacency matrix should yield connected clusters by this method. I think this is a more dramatic showing of what @glemaitre said when the embedding appears more compact.

An example is below from Rook contiguity for US counties. sklearn 0.19.1 is on right, this patch is on left. You see lots of enclaves at some weaker borders:
2017-12-05-182433_1148x452_scrot
2017-12-05-182455_1123x463_scrot
2017-12-05-182630_1152x452_scrot

And sometimes poorly-connected nodes are not in the right cluster:
2017-12-05-182653_1109x436_scrot

If this counts as a "better" example case @jmargeta, I'm happy to contrib.

@@ -38,8 +38,7 @@ if [[ "$DISTRIB" == "conda" ]]; then
conda update --yes conda

TO_INSTALL="python=$PYTHON_VERSION pip pytest pytests-cov" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that removing the quotes on each line also fixes the issue.
Such as:

TO_INSTALL="python=$PYTHON_VERSION pip pytest pytests-cov \
            numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION \
            cython=$CYTHON_VERSION"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is more readable :) I should learn bash at some point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too :) Sorry for not noticing the pytests-cov typo.

@glemaitre
Copy link
Member

@lesteve I think that I can have my bonus point (even if I only copy paste and mess-up my bash writing)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not extremely confident, but I think this is good...

@jnothman
Copy link
Member

Apart from sorting out the CI, that is.

source activate testenv

if [[ -n "$PYAMG_VERSION" ]]; then
conda install --yes pyamg=$PYAMG_VERSION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you can install pyamg with conda, just use TO_INSTALL before activating the virtualenv, e.g. similar to what you do for pandas.

nomkl cython=$CYTHON_VERSION \
${PANDAS_VERSION+pandas=$PANDAS_VERSION}
TO_INSTALL="$TO_INSTALL mkl"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you need a else clause with nomkl.

Copy link
Contributor Author

@jmargeta jmargeta Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve I do not understand why. Doesn't conda automatically install the nomkl version?
Do I assume correctly you mean this?

    if [[ "$INSTALL_MKL" == "true" ]]; then
        TO_INSTALL="$TO_INSTALL mkl"     
    else
        TO_INSTALL="$TO_INSTALL nomkl"
    fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found out why :) Change done. Well spotted.

Pyamg is added to the list of packages to be installed upon creation of conda virtual environment on Travis.
Based in Loïc's comment.
Wrapped line with author list.
Test that spectral clustering raises an exception
if amg solver is selected but pyamg not installed.
@lesteve
Copy link
Member

lesteve commented Dec 15, 2017

I don't understand all the details, but it looks like this is an improvement so I would be in favour of merging.

@lesteve
Copy link
Member

lesteve commented Dec 18, 2017

Let's be slightly bold and merge this one! Thanks a lot @devanshdalal, @jmargeta and @glemaitre!

@lesteve lesteve merged commit d01cdc2 into scikit-learn:master Dec 18, 2017
@jmargeta jmargeta deleted the devanshdalal/fix-#8129 branch December 18, 2017 11:16
@@ -1,6 +1,6 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait that's the first explicit dependency on pytest, right?

Copy link
Member

@glemaitre glemaitre Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop, it has been discuss in another PR:
#10081 (comment)

We also added there: https://github.com/scikit-learn/scikit-learn/blob/4f710cdd088aa8851e8b049e4faafa03767fda10/sklearn/preprocessing/tests/test_target.py

There is some in the ColumnTransformer.

Do you wish to revert it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's good, sorry I'm out of the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have ping you there thought

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure that wouldn't have gotten lost in my 17615 unread scikit-learn notifications ;) Hm can I filter stuff that I'm pinged in... good question...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that reduces it to only 5782 (only unread threads that I'm subscribed to)!

@amueller
Copy link
Member

Btw, is this related to #811? Or was #811 just a feature request?

@glemaitre
Copy link
Member

uhm I have to check it. FYI, since #9077 we completely rely on scipy for the computation of the Laplacian and we don't have any backport.

@glemaitre
Copy link
Member

If there is a bug we have to check there:
https://github.com/scipy/scipy/blob/v1.0.0/scipy/sparse/csgraph/_laplacian.py#L114

@amueller
Copy link
Member

oh, well, #811 is just totally out of date then and I'll close it ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A math mistake in spectral_embedding
9 participants