Skip to content

FIX Fix multiple severe bugs in non-metric MDS #30514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Mar 31, 2025

Conversation

dkobak
Copy link
Contributor

@dkobak dkobak commented Dec 20, 2024

Reference Issues/PRs

Fixes #16846.
Fixes #18933.
Fixes #26999.
Fixes #27028.

What does this implement/fix? Explain your changes.

This fixes several bugs in non-metric MDS. The current implementation of non-metric MDS has severe implementation errors and is broken. See example in #27028: the loss increases after a couple of iterations (which should not happen) and algorithm stops. One error was pointed out in #18933: the upper-triangular part of the disparities matrix was not correctly copied into the lower-triangular part. I fixed it. The second error was that the isotonic regression should not be run before the first SMACOF iteration. The SMACOF update should happen first.

#26999 is a separate issue due to IsotonicRegression triggering out_of_bounds errors due to machine precision issues. Very easy fix via out_of_bounds="clip".

#16846 is another separate issue: the returns stress value (MDS loss) does not correspond to the returned embedding, but to the previous iteration. Very easy fix by moving the stress computation after the SMACOF update.

For convenience, I also did a small API change: non-metric MDS uses normalized_stress and the current API for some reason does not allow normalized_stress=True when metric=True. I don't see any reason why this should not be allowed. It works just fine, and allows to compare normalized stress between metric and non-metric MDS. So I removed this restriction. It originated in #22562.

Any other comments?

Demonstration on the Digits dataset:

import pylab as plt
from sklearn.manifold import MDS
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
mds = MDS(n_init=1, verbose=10, random_state=42, metric=False, normalized_stress=True, eps=1e-15)
Z = mds.fit_transform(X)
stress = mds.stress_

plt.figure(figsize=(4, 4))
plt.scatter(Z[:,0], Z[:,1], c=y, s=3, cmap="tab10")
plt.title(f"Digits dataset\nNEW non-metric MDS, norm. stress = {stress:.2f}")

Metric MDS:

mds1

Non-metric MDS current:

mds3

Non-metric MDS with this fix:

mds2

Apart from that, I would like to refactor the code of _smacof_single() for readability and also to do two API changes:

  1. Current default is n_init=4 which runs MDS four times. Other sklearn classes that offer this functionality use n_init=1 by default, e.g. sklearn.mixture.GaussianMixture. This seems much more sensible to me, so I would suggest to switch to n_init=1 here as well. The algorithm is slow as it is.
  2. The convergence criterion is odd: it normalizes the stress by np.sqrt((X**2).sum(axis=1)).sum() but it is unclear why. The default eps=1e-3 leads to bad underconvergence in some cases, e.g. on the Digits dataset in non-metric case. That's why I used eps=1e-15 in my code above. I would like to change the convergence criterion to (old_stress - stress) / stress < eps.
  3. I think the reasonable default initialization is not random but PCA. In case the input is not the data matrix but the distance matrix, one should replace PCA with PCoA (aka "classical MDS") and use that as initialization. Unfortunately, sklearn does not have classical MDS implemented; there are open PRs about that but they are stalled. If/when ClassicalMDS class is implemented, that should be use as the default initialization for non-metric (and also metric) MDS.

I did not include any of this into this PR because these are not "bug fixes" so I thought it should be separate PRs. But I can also see an argument for including (1) and (2) inside this PR as the changes are trivially simple. Let me know what you prefer.

@dkobak dkobak changed the title Fix multiple severe bugs in non-metric MDS FIX Fix multiple severe bugs in non-metric MDS Dec 20, 2024
Copy link

github-actions bot commented Dec 20, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 108321c. Link to the linter CI: here

@adrinjalali
Copy link
Member

@antoinebaker could you please have a look at this one?

@antoinebaker
Copy link
Contributor

antoinebaker commented Mar 21, 2025

Thanks a lot for the PR @dkobak ! It seems that there are indeed several implementation errors in _smacof_single.

As a general question, which reference do you follow to fix the implementation ? I have looked at two references so far:

  1. Modern multidimensional scaling, Borg and Groenen. The algorithm is described in "Smacof with Admissibly Transformed Proximities".
  2. smacof R package. The metric and nonmetric MDS are I think implemented in smacofSym

They slightly differ about this point:

The second error was that the isotonic regression should not be run before the first SMACOF iteration. The SMACOF update should happen first.

I think 2. follows your idea, but 1. does the monotonic regression first.

@dkobak
Copy link
Contributor Author

dkobak commented Mar 21, 2025

Hi @antoinebaker, thanks for looking into this. I would really like to have this fixed.

I think I looked at how this is done in R as the R implementation seems to me to be the reference one for smacof. It is co-developed by Jan de Leeuw who was the original author of the smacof algorithm (1977 paper). If I remember correctly, I looked at exactly the same paper as you found: https://www.jstatsoft.org/article/view/v102i10.

In general, it's an alternating algorithm, and as with any alternating algorithm it does not matter in principle which part to begin with. However, in practice, one way of starting it can be a better heuristic to ensure better convergence. I have not done many experiments, I just observed that on the Digits dataset, it only works one way. And as that is the way that is implemented in R, I thought we should follow that.

I suspect that with better initialization (via classical MDS, see my point (3) above) this issue of whether isotonic regression goes first or second should not matter. But classical MDS initialization would need to wait for another PR, so for now I would go ahead and mimic the R approach.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add non-regression tests for all the issues this is fixing?

@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2025

I also tried to re-run the MDS usage example, and it appears that this PR fixes it:

  • on main:

image

  • on this branch:

image

Some variable names in the examples are either confusing (clf for the PCA instance) or not very explicit (pos / npos). The explanations in the example could be improved, in particular to explain the data generating process, and an analysis of the obtained results could be added. But improving the example can be done in a follow-up PR.

@dkobak
Copy link
Contributor Author

dkobak commented Mar 24, 2025

@ogrisel Hmm, I haven't seen this example before, but it seems that what happens on main is that the sign of the 1st PC axis of MDS/NMDS is flipped compared to X_true. If you were to flip the sign, then it would fit very well. So I think the fact that it "works better" on this branch is accidental. Somebody should just fix the sign of PC1 / PC2 in that example.

@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2025

Could you please add non-regression tests for all the issues this is fixing?

In particular, please add non-regression tests for all the issues fixed by this PR as per-this PR's description.

I checked that the reproducers of #26999 and #16846 are fixed with this PR. Those could probably be adapted as non-regression tests.

I am not sure how to turn #18933 into a non-regression test.

#27028 is nice, but too slow to run as a test. Maybe it could be turned into an MDS usage example, but we already have many examples in the example gallery and we are trying to reduce their number by consolidating them rather than add new ones.

@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2025

I think I looked at how this is done in R as the R implementation seems to me to be the reference one for smacof

Maybe we could include a test on a toy dataset where scikit-learn used to produce bad results and check that we get the same results as computed by the R-based reference implementation, both for the stress values and the output of the transform method on a few arbitrary data points.

We do not want to include a dependency to R and rpy2 to run the scikit-learn test, but I would not be opposed to including some rpy2-based code snippet as an inline comment to recompute the expected result values in such a test function.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more inline feedback below:

@dkobak
Copy link
Contributor Author

dkobak commented Mar 24, 2025

Some variable names in the examples are either confusing (clf for the PCA instance) or not very explicit (pos / npos).

@ogrisel: I edited the example to fix the sign of PCs, and added it to this PR. Also did some minor edits to rename the variables and improve readability.

@dkobak
Copy link
Contributor Author

dkobak commented Mar 24, 2025

I have added some tests @ogrisel.

I checked that the reproducers of #26999 and #16846 are fixed with this PR. Those could probably be adapted as non-regression tests.

Done, I added two tests for that.

I am not sure how to turn #18933 into a non-regression test.

I am not sure either, and I think this is not needed.

#27028 is nice, but too slow to run as a test.

I agree. We could add a test that running 3 iterations on Digits makes stress lower than running 2 iterations, but even that takes 2-3 seconds. Let me know if it's acceptable, but for now I did not include that.

Maybe we could include a test on a toy dataset where scikit-learn used to produce bad results and check that we get the same results as computed by the R-based reference implementation

This makes sense but the problem is that I don't have a toy dataset on which NMDS failed before. I only observed it on Digits and other even larger datasets. On very small toy datasets (like in the usage example you showed above) NMDS worked fine.

At least I added a test that NMDS gives lower (normalized) stress than metric MDS. And that both can recover original data.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the MDS literature, so I trust you and @antoinebaker that this diff is actually in line with the reference implementation in R and the main references in the literature.

But from the look of the tests and the fact that this PR actually fixes the reproducers reported in the linked issues, this LGTM.

@ogrisel
Copy link
Member

ogrisel commented Mar 25, 2025

We could add a test that running 3 iterations on Digits makes stress lower than running 2 iterations, but even that takes 2-3 seconds. Let me know if it's acceptable, but for now I did not include that.

That feels a bit too slow. Let's hope the newly introduced tests are good enough as non-regression tests.

Once this is merged, +1 for follow-up PRs to address the other points raised in the descriptions.

Copy link
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for User Guide, it's much better now. Not sure why the documentation fails on the CI, maybe try to merge main to see if the problem remains.

dkobak and others added 7 commits March 27, 2025 12:46
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
@dkobak
Copy link
Contributor Author

dkobak commented Mar 27, 2025

maybe try to merge main to see if the problem remains.

This did the trick.

I incorporated all your edits, and did some further edits to ensure consistent notation in the docs.

Copy link
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos.

dkobak and others added 4 commits March 27, 2025 14:23
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
@dkobak
Copy link
Contributor Author

dkobak commented Mar 27, 2025

Thanks @antoinebaker! Now all checks have passed, yay! What's next?

dkobak and others added 2 commits March 28, 2025 11:03
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
@dkobak
Copy link
Contributor Author

dkobak commented Mar 28, 2025

@adrinjalali I think this is ready. Take a look if you can approve it. @ogrisel already did.

After this is merged, I will start working on the follow-up PR concerning default n_init and the convergence criterion.

PS. Some checks are failing right now, but they seem to have nothing to do with this PR.

Copy link
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @dkobak !
The CI failure is unrelated to the PR , see #31098.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@adrinjalali adrinjalali merged commit 5f91dca into scikit-learn:main Mar 31, 2025
36 checks passed
@dkobak dkobak deleted the non-metric-mds-bugfixes branch March 31, 2025 09:55
@dkobak
Copy link
Contributor Author

dkobak commented Mar 31, 2025

Dear @ogrisel @adrinjalali @antoinebaker, I went ahead and created PR #31117 to change the defaults for n_init and eps and to adjust the convergence criterion. Feedback very welcome over there.

lucyleeow pushed a commit to lucyleeow/scikit-learn that referenced this pull request Apr 2, 2025
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants