-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Fix multiple severe bugs in non-metric MDS #30514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Fix multiple severe bugs in non-metric MDS #30514
Conversation
@antoinebaker could you please have a look at this one? |
Thanks a lot for the PR @dkobak ! It seems that there are indeed several implementation errors in As a general question, which reference do you follow to fix the implementation ? I have looked at two references so far:
They slightly differ about this point:
I think 2. follows your idea, but 1. does the monotonic regression first. |
Hi @antoinebaker, thanks for looking into this. I would really like to have this fixed. I think I looked at how this is done in R as the R implementation seems to me to be the reference one for smacof. It is co-developed by Jan de Leeuw who was the original author of the smacof algorithm (1977 paper). If I remember correctly, I looked at exactly the same paper as you found: https://www.jstatsoft.org/article/view/v102i10. In general, it's an alternating algorithm, and as with any alternating algorithm it does not matter in principle which part to begin with. However, in practice, one way of starting it can be a better heuristic to ensure better convergence. I have not done many experiments, I just observed that on the Digits dataset, it only works one way. And as that is the way that is implemented in R, I thought we should follow that. I suspect that with better initialization (via classical MDS, see my point (3) above) this issue of whether isotonic regression goes first or second should not matter. But classical MDS initialization would need to wait for another PR, so for now I would go ahead and mimic the R approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add non-regression tests for all the issues this is fixing?
I also tried to re-run the MDS usage example, and it appears that this PR fixes it:
Some variable names in the examples are either confusing ( |
@ogrisel Hmm, I haven't seen this example before, but it seems that what happens on |
In particular, please add non-regression tests for all the issues fixed by this PR as per-this PR's description. I checked that the reproducers of #26999 and #16846 are fixed with this PR. Those could probably be adapted as non-regression tests. I am not sure how to turn #18933 into a non-regression test. #27028 is nice, but too slow to run as a test. Maybe it could be turned into an MDS usage example, but we already have many examples in the example gallery and we are trying to reduce their number by consolidating them rather than add new ones. |
Maybe we could include a test on a toy dataset where scikit-learn used to produce bad results and check that we get the same results as computed by the R-based reference implementation, both for the stress values and the output of the We do not want to include a dependency to R and rpy2 to run the scikit-learn test, but I would not be opposed to including some rpy2-based code snippet as an inline comment to recompute the expected result values in such a test function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more inline feedback below:
@ogrisel: I edited the example to fix the sign of PCs, and added it to this PR. Also did some minor edits to rename the variables and improve readability. |
I have added some tests @ogrisel.
Done, I added two tests for that.
I am not sure either, and I think this is not needed.
I agree. We could add a test that running 3 iterations on Digits makes stress lower than running 2 iterations, but even that takes 2-3 seconds. Let me know if it's acceptable, but for now I did not include that.
This makes sense but the problem is that I don't have a toy dataset on which NMDS failed before. I only observed it on Digits and other even larger datasets. On very small toy datasets (like in the usage example you showed above) NMDS worked fine. At least I added a test that NMDS gives lower (normalized) stress than metric MDS. And that both can recover original data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the MDS literature, so I trust you and @antoinebaker that this diff is actually in line with the reference implementation in R and the main references in the literature.
But from the look of the tests and the fact that this PR actually fixes the reproducers reported in the linked issues, this LGTM.
That feels a bit too slow. Let's hope the newly introduced tests are good enough as non-regression tests. Once this is merged, +1 for follow-up PRs to address the other points raised in the descriptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for User Guide, it's much better now. Not sure why the documentation fails on the CI, maybe try to merge main to see if the problem remains.
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
This did the trick. I incorporated all your edits, and did some further edits to ensure consistent notation in the docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few typos.
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
…arn into non-metric-mds-bugfixes
Thanks @antoinebaker! Now all checks have passed, yay! What's next? |
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
@adrinjalali I think this is ready. Take a look if you can approve it. @ogrisel already did. After this is merged, I will start working on the follow-up PR concerning default PS. Some checks are failing right now, but they seem to have nothing to do with this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
Dear @ogrisel @adrinjalali @antoinebaker, I went ahead and created PR #31117 to change the defaults for |
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Reference Issues/PRs
Fixes #16846.
Fixes #18933.
Fixes #26999.
Fixes #27028.
What does this implement/fix? Explain your changes.
This fixes several bugs in non-metric MDS. The current implementation of non-metric MDS has severe implementation errors and is broken. See example in #27028: the loss increases after a couple of iterations (which should not happen) and algorithm stops. One error was pointed out in #18933: the upper-triangular part of the disparities matrix was not correctly copied into the lower-triangular part. I fixed it. The second error was that the isotonic regression should not be run before the first SMACOF iteration. The SMACOF update should happen first.
#26999 is a separate issue due to
IsotonicRegression
triggeringout_of_bounds
errors due to machine precision issues. Very easy fix viaout_of_bounds="clip"
.#16846 is another separate issue: the returns stress value (MDS loss) does not correspond to the returned embedding, but to the previous iteration. Very easy fix by moving the stress computation after the SMACOF update.
For convenience, I also did a small API change: non-metric MDS uses
normalized_stress
and the current API for some reason does not allownormalized_stress=True
whenmetric=True
. I don't see any reason why this should not be allowed. It works just fine, and allows to compare normalized stress between metric and non-metric MDS. So I removed this restriction. It originated in #22562.Any other comments?
Demonstration on the Digits dataset:
Metric MDS:
Non-metric MDS current:
Non-metric MDS with this fix:
Apart from that, I would like to refactor the code of
_smacof_single()
for readability and also to do two API changes:n_init=4
which runs MDS four times. Other sklearn classes that offer this functionality usen_init=1
by default, e.g.sklearn.mixture.GaussianMixture
. This seems much more sensible to me, so I would suggest to switch ton_init=1
here as well. The algorithm is slow as it is.np.sqrt((X**2).sum(axis=1)).sum()
but it is unclear why. The defaulteps=1e-3
leads to bad underconvergence in some cases, e.g. on the Digits dataset in non-metric case. That's why I usedeps=1e-15
in my code above. I would like to change the convergence criterion to(old_stress - stress) / stress < eps
.ClassicalMDS
class is implemented, that should be use as the default initialization for non-metric (and also metric) MDS.I did not include any of this into this PR because these are not "bug fixes" so I thought it should be separate PRs. But I can also see an argument for including (1) and (2) inside this PR as the changes are trivially simple. Let me know what you prefer.