-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Change the default n_init
and eps
for MDS
#31117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
n_init
for MDSn_init
and eps
for MDS
@antoinebaker IMHO it would be great to have this in 1.7 so that the deprecation cycle rolls out by 1.9... Would be amazing if you could take a look. Thanks! |
Will try ! But not sure we can make it to the 1.7 release, which is coming very soon :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @dkobak for the follow-up PR !
The stopping criterion makes much more sense now. The plot_mds.py example is improved with the tutorial like presentation.
Even if our goal is not to reproduce the R code exactly, could you provide a comparison of the R / sklearn results on some examples to see if the found stress
are in the same ballpark ? I guess comparing the embeddings is more difficult because of rotational invariance.
Below a few suggestions to ease maintenance.
@antoinebaker Thanks for the review, I added all your suggestions.
This makes sense, however I am not a R user and so cannot do this very easily... If you think this is important, I can try to set it up. |
Okay, I am trying to set up the comparison. I'm taking the data from here: https://cran.r-project.org/web/packages/smacof/vignettes/mdsnutshell.html S = np.array([[0. , 0.86, 0.42, 0.42, 0.18, 0.06, 0.07, 0.04, 0.02, 0.07, 0.09,
0.12, 0.13, 0.16],
[0.86, 0. , 0.5 , 0.44, 0.22, 0.09, 0.07, 0.07, 0.02, 0.04, 0.07,
0.11, 0.13, 0.14],
[0.42, 0.5 , 0. , 0.81, 0.47, 0.17, 0.1 , 0.08, 0.02, 0.01, 0.02,
0.01, 0.05, 0.03],
[0.42, 0.44, 0.81, 0. , 0.54, 0.25, 0.1 , 0.09, 0.02, 0.01, 0. ,
0.01, 0.02, 0.04],
[0.18, 0.22, 0.47, 0.54, 0. , 0.61, 0.31, 0.26, 0.07, 0.02, 0.02,
0.01, 0.02, 0. ],
[0.06, 0.09, 0.17, 0.25, 0.61, 0. , 0.62, 0.45, 0.14, 0.08, 0.02,
0.02, 0.02, 0.01],
[0.07, 0.07, 0.1 , 0.1 , 0.31, 0.62, 0. , 0.73, 0.22, 0.14, 0.05,
0.02, 0.02, 0. ],
[0.04, 0.07, 0.08, 0.09, 0.26, 0.45, 0.73, 0. , 0.33, 0.19, 0.04,
0.03, 0.02, 0.02],
[0.02, 0.02, 0.02, 0.02, 0.07, 0.14, 0.22, 0.33, 0. , 0.58, 0.37,
0.27, 0.2 , 0.23],
[0.07, 0.04, 0.01, 0.01, 0.02, 0.08, 0.14, 0.19, 0.58, 0. , 0.74,
0.5 , 0.41, 0.28],
[0.09, 0.07, 0.02, 0. , 0.02, 0.02, 0.05, 0.04, 0.37, 0.74, 0. ,
0.76, 0.62, 0.55],
[0.12, 0.11, 0.01, 0.01, 0.01, 0.02, 0.02, 0.03, 0.27, 0.5 , 0.76,
0. , 0.85, 0.68],
[0.13, 0.13, 0.05, 0.02, 0.02, 0.02, 0.02, 0.02, 0.2 , 0.41, 0.62,
0.85, 0. , 0.76],
[0.16, 0.14, 0.03, 0.04, 0. , 0.01, 0. , 0.02, 0.23, 0.28, 0.55,
0.68, 0.76, 0. ]])
D = 1 - S
np.fill_diagonal(D, 0) I'm running R here: https://webr.r-wasm.org/latest/ library(smacof)
D = 1 - ekman
mds(D)
mds(D, type="ordinal") Results:
In Python (this branch): np.random.seed(42)
MDS(dissimilarity="precomputed", metric=True, normalized_stress=True, n_init=1).fit(D).stress_ # 0.137
MDS(dissimilarity="precomputed", metric=True, normalized_stress=True, n_init=1, eps=1e-6).fit(D).stress_ # 0.132
MDS(dissimilarity="precomputed", metric=False, normalized_stress=True, n_init=1).fit(D).stress_ # 0.372
MDS(dissimilarity="precomputed", metric=False, normalized_stress=True, n_init=1, eps=1e-6).fit(D).stress_ # 0.031 I think it fits well enough! Clearly the new Note: I cannot get down to 0.023 stress-1 with the Python implementation, and I am not sure why, but this is out of scope of this PR. For reference, here is the R code: https://github.com/cran/smacof/blob/master/R/smacofSym.R. |
@antoinebaker I fixed the issue in my comparison code (had to add |
Thanks for comparison examples! To share and reproduce, could you provide them as gists for the python and R code ? The python one can be a notebook, the R code just a plain script with the results commented as you have done. Another possibility is to have a small public repository with the two code snippets. |
For the This PR is already a much needed improvement for MDS, so +1 to merge it as is. |
Here is the gist, I put the R code into the comments on top: https://gist.github.com/dkobak/3daf73495b3da3b23dfbfd6b0b441030 |
Any chance to get it into 1.7? I am not sure what the timeline for that is. |
Note that the suggestions in my review above are incomplete: tests also need to be updated accordingly. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Thanks @ogrisel. I have edited the PR to make the change of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the quick follow-up. Here is a final comment but otherwise LGTM.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nitpicks (mainly removing superfluous eps=1e-6
as it is the new default), otherwise LGTM !
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
I enabled auto-merge so that the PR will be merged if CI is green after this commit. @dkobak instead of committing individual review suggestions, you can press the "add suggestion to batch" from the diff view of the PR and then do a single commit for a group of suggestions to avoid putting unnecessary CI usage. |
Thanks. I noticed that the what's-new entries had wrong markup formatting (mea culpa) and also found many other existing what's-new entries for 1.7 that have wrong markup and render incorrectly. I fixed all of them in #31272. |
This is a follow-up to #30514 and has been discussed in there to some extent. It fixes two issues:
Current default in MDS is
n_init=4
, which runs MDS four times. Other sklearn classes that offer this functionality usen_init=1
by default, e.g.sklearn.mixture.GaussianMixture
. This appears much more sensible to me. So I change the default ton_init=1
.The convergence criterion was really strange and unmotivated, and the default
eps
led to really bad underconvergence on some simple datasets. I am changing it to the convergence criterion that (i) roughly follows the R implementation, that (ii) makes sense for both metric and non-metric MDS, and that (iii) is not affected by any rescaling of the input matrixX
. The new convergence criterion is((old_stress - stress) / ((distances.ravel() ** 2).sum() / 2)) < eps
and the defaulteps=1e-6
as in the R implementation.Apart from that, I fixed the formula for the "normalized stress" aka "stress-1" (as discussed in the previous PR), and added several tests.
I implemented FutureWarnings until v1.9 and corresponding tests.
Here is the result of running this code with the new default parameters on a small subset of Digits dataset.
Note that both embeddings converge within ~200 iterations, and that non-metric MDS has lower normalized stress than metric MDS, as expected.
Running this with current default (removing
n_init=1, eps=1e-6
from the MDS calls) produces awful results, as the convergence criterion hits way too early:Almost the same thing happens on
main
. So in my opinion the currenteps
value is dysfunctional, especially for non-metric MDS, and the currentn_init
value is a waste of computations.