-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Fix of Spectral embedding implementation #9062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix of Spectral embedding implementation #9062
Conversation
What do you think @GaelVaroquaux @glemaitre @massich ? |
LGTM |
@agramfort Do you have any thoughts regarding the visible difference on the S-curve dataset. The embedding is more compact now. |
indeed it's not as nice ... hum
|
random_state=seed) | ||
|
||
assert_array_almost_equal(np.diff(embedding[:, 0]), 0) | ||
assert_array_equal(np.abs(np.diff(embedding[:, 1])) > 1e-3, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second test asserts that no two consecutive elements are close, which seems stricter than what is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @vene Thanks for your input.
All I want is a test asserting that the elements are different (not constant).
The condition can be probably relaxed a bit with something like assert_greater(np.std(embedding[:, 1]), 1e-3)
.
Is this any better? What would you suggest instead of the current test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two goals here:
- make the assertions emantically as close to what we really mean
- make the assertions understandable by a new coder one year from now
I think you could use the std of the vector for both tests: for the first test the std should be nearly 0, for the second it should be nonzero. but add a small comment next to each?
A stronger idea for the first assertion: don't we mathematically know exactly what value the first component needs to have? (maybe at least when normed?) I'm not sure off the top of my head
Spectral embedding does look broken. |
|
Uhm now that I am fresh, the laplacian is fine.
|
@glemaitre Was a good idea to relook into the Laplacian. |
@jmargeta if you could solve the PEP8 error and push such that we pass all the tests at least |
Codecov Report
@@ Coverage Diff @@
## master #9062 +/- ##
=========================================
Coverage ? 96.23%
=========================================
Files ? 332
Lines ? 59891
Branches ? 0
=========================================
Hits ? 57636
Misses ? 2255
Partials ? 0
Continue to review full report at Codecov.
|
@glemaitre That is correct. Even the two Matlab-based implementations do not unroll the swiss roll (same for S-curve) onto a "flat" 2d surface as I would expect. I intend to check two more changes in construction of the affinity graph this afternoon to make a better example case (if possible) and will come back to you all. |
To follow fix in spectral embedding
For fully connected graphs the first eigenvector should be constant while the others not.
Replaces AMG solver test with a more complete test - NOTE: paths depending on pyamg are not automatically tested on Travis - when the pyamg dependency is installed, the original fails also on master probably for a very long time - test based on scikit-learn toy image segmentation example instead comparing results of all three solvers to be the same
First eigenvector should be kept for clustering. It is only constant for fully connected graphs. Reverts 95d8111
Fix a seed so that the example would not fail in plotting. "discretize" might return empty clusters causing contour to fail
c9cec63
to
fea5b83
Compare
Semantics of the tests should be clearer now. Based on suggestions of @vene
Hi, I'm using this patch locally for some spatially-constrained spectral clustering, and I see the correct behavior when using this patch. The difference is pretty dramatic for my use case. Namely, when you're clustering on a geographic lattice's adjacency matrix, without this fix, the discovered clusters are sometimes not contiguous. As far as I understand it, a binary adjacency matrix should yield connected clusters by this method. I think this is a more dramatic showing of what @glemaitre said when the embedding appears more compact. An example is below from Rook contiguity for US counties. And sometimes poorly-connected nodes are not in the right cluster: If this counts as a "better" example case @jmargeta, I'm happy to contrib. |
build_tools/travis/install.sh
Outdated
@@ -38,8 +38,7 @@ if [[ "$DISTRIB" == "conda" ]]; then | |||
conda update --yes conda | |||
|
|||
TO_INSTALL="python=$PYTHON_VERSION pip pytest pytests-cov" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that removing the quotes on each line also fixes the issue.
Such as:
TO_INSTALL="python=$PYTHON_VERSION pip pytest pytests-cov \
numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION \
cython=$CYTHON_VERSION"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is more readable :) I should learn bash at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me too :) Sorry for not noticing the pytests-cov typo.
@lesteve I think that I can have my bonus point (even if I only copy paste and mess-up my bash writing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not extremely confident, but I think this is good...
Apart from sorting out the CI, that is. |
build_tools/travis/install.sh
Outdated
source activate testenv | ||
|
||
if [[ -n "$PYAMG_VERSION" ]]; then | ||
conda install --yes pyamg=$PYAMG_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you can install pyamg with conda, just use TO_INSTALL before activating the virtualenv, e.g. similar to what you do for pandas.
build_tools/travis/install.sh
Outdated
nomkl cython=$CYTHON_VERSION \ | ||
${PANDAS_VERSION+pandas=$PANDAS_VERSION} | ||
TO_INSTALL="$TO_INSTALL mkl" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you need a else clause with nomkl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve I do not understand why. Doesn't conda automatically install the nomkl version?
Do I assume correctly you mean this?
if [[ "$INSTALL_MKL" == "true" ]]; then
TO_INSTALL="$TO_INSTALL mkl"
else
TO_INSTALL="$TO_INSTALL nomkl"
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found out why :) Change done. Well spotted.
Pyamg is added to the list of packages to be installed upon creation of conda virtual environment on Travis. Based in Loïc's comment.
Wrapped line with author list.
Based on Loïc's comments
Test that spectral clustering raises an exception if amg solver is selected but pyamg not installed.
I don't understand all the details, but it looks like this is an improvement so I would be in favour of merging. |
Let's be slightly bold and merge this one! Thanks a lot @devanshdalal, @jmargeta and @glemaitre! |
@@ -1,6 +1,6 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait that's the first explicit dependency on pytest, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nop, it has been discuss in another PR:
#10081 (comment)
We also added there: https://github.com/scikit-learn/scikit-learn/blob/4f710cdd088aa8851e8b049e4faafa03767fda10/sklearn/preprocessing/tests/test_target.py
There is some in the ColumnTransformer.
Do you wish to revert it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's good, sorry I'm out of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have ping you there thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure that wouldn't have gotten lost in my 17615 unread scikit-learn notifications ;) Hm can I filter stuff that I'm pinged in... good question...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that reduces it to only 5782 (only unread threads that I'm subscribed to)!
uhm I have to check it. FYI, since #9077 we completely rely on scipy for the computation of the Laplacian and we don't have any backport. |
If there is a bug we have to check there: |
oh, well, #811 is just totally out of date then and I'll close it ;) |
Reference Issue
Fixes #8129
Continuation of and closes #8217 by @devanshdalal
What does this implement/fix? Explain your changes.
This fixes computation of the spectral embedding
(normalization of the spectrum by division instead of multiplication).
The first eigenvector of the graph Laplacian is now constant (property of fully connected graphs).
These changes add tests for untested codebase.
NOTE: Parts depending on pyamg are not automatically tested on Travis
Testing of all backends would require to install optional package pyamg.
Replaces AMG solver test with a more complete test
- when the pyamg dependency is installed, the original fails also on master (untouched for 5 years)
- the new test based on scikit-learn toy image segmentation example instead
compares results of all three solvers to be the same
Any other comments?
The examples are mostly visually similar. With two exceptions:
the quality of the results for discretized label assignment method in examples/cluster/plot_face_segmentation.py demo qualitatively improves
.
In this example a new seed was set to avoid failed plotting for "discretize" option due to empty clusters (matplotlib's contour fail)
in examples/manifold/plot_compare_methods.py, for manifold comparison for SpectralEmbedding looses its spread. A more spread out distribution is reported for similar example in the original papers.
.