Skip to content

FIX Fix multiple severe bugs in non-metric MDS #30514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Mar 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
439f9b0
Fix multiple bugs in nonmetric MDS
dkobak Dec 20, 2024
505e6f4
Add what's new
dkobak Dec 20, 2024
ebd156c
Remove obsolete test
dkobak Dec 20, 2024
5037cbf
Update example to pass doctest
dkobak Dec 20, 2024
e4de182
Fix the MDS example
dkobak Mar 24, 2025
1cf59c5
Fix the docs and docstrings regarding normalized_stress
dkobak Mar 24, 2025
0ccd609
Add tests
dkobak Mar 24, 2025
40baee9
Add one more test
dkobak Mar 24, 2025
60f7364
Add issue number to test
dkobak Mar 25, 2025
287a950
Add random seed
dkobak Mar 25, 2025
a7bedf4
Add refernce to the R implementation
dkobak Mar 25, 2025
437f146
Remove double line break
dkobak Mar 25, 2025
75297f8
Rename some variables
dkobak Mar 26, 2025
4a2346a
Fix the User Guide for MDS
dkobak Mar 26, 2025
8bed24c
Fix the User Guide
dkobak Mar 26, 2025
0a4dc1f
fix the typo that has been fixed in upstream/main
dkobak Mar 27, 2025
28f69d5
Merge branch 'main' into non-metric-mds-bugfixes
dkobak Mar 27, 2025
ce121e0
Update examples/manifold/plot_mds.py
dkobak Mar 27, 2025
e1a8e35
Update examples/manifold/plot_mds.py
dkobak Mar 27, 2025
11110d9
Update doc/modules/manifold.rst
dkobak Mar 27, 2025
f91923f
Update doc/modules/manifold.rst
dkobak Mar 27, 2025
18fc2a8
Update doc/modules/manifold.rst
dkobak Mar 27, 2025
2d9b7bf
update docs
dkobak Mar 27, 2025
93516a0
update docs
dkobak Mar 27, 2025
fb0e1d6
Fix an error in the example
dkobak Mar 27, 2025
2756453
Update doc/modules/manifold.rst
dkobak Mar 27, 2025
b4cb108
Update doc/modules/manifold.rst
dkobak Mar 27, 2025
d24f784
Exclude verbose outputs from coverage tests
dkobak Mar 27, 2025
65edaa9
Merge branch 'non-metric-mds-bugfixes' of github.com:dkobak/scikit-le…
dkobak Mar 27, 2025
45053be
Update doc/modules/manifold.rst
dkobak Mar 28, 2025
fc3b95c
Clarify formulation
dkobak Mar 28, 2025
108321c
Merge branch 'main' into non-metric-mds-bugfixes
dkobak Mar 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 39 additions & 37 deletions doc/modules/manifold.rst
Original file line number Diff line number Diff line change
Expand Up @@ -418,67 +418,65 @@ Multi-dimensional Scaling (MDS)
representation of the data in which the distances respect well the
distances in the original high-dimensional space.

In general, :class:`MDS` is a technique used for analyzing similarity or
dissimilarity data. It attempts to model similarity or dissimilarity data as
distances in a geometric space. The data can be ratings of similarity between
In general, :class:`MDS` is a technique used for analyzing
dissimilarity data. It attempts to model dissimilarities as
distances in a Euclidean space. The data can be ratings of dissimilarity between
objects, interaction frequencies of molecules, or trade indices between
countries.

There exist two types of MDS algorithm: metric and non-metric. In
scikit-learn, the class :class:`MDS` implements both. In Metric MDS, the input
similarity matrix arises from a metric (and thus respects the triangular
inequality), the distances between output two points are then set to be as
close as possible to the similarity or dissimilarity data. In the non-metric
version, the algorithms will try to preserve the order of the distances, and
scikit-learn, the class :class:`MDS` implements both. In metric MDS,
the distances in the embedding space are set as
close as possible to the dissimilarity data. In the non-metric
version, the algorithm will try to preserve the order of the distances, and
hence seek for a monotonic relationship between the distances in the embedded
space and the similarities/dissimilarities.
space and the input dissimilarities.

.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_010.png
:target: ../auto_examples/manifold/plot_lle_digits.html
:align: center
:scale: 50


Let :math:`S` be the similarity matrix, and :math:`X` the coordinates of the
:math:`n` input points. Disparities :math:`\hat{d}_{ij}` are transformation of
the similarities chosen in some optimal ways. The objective, called the
stress, is then defined by :math:`\sum_{i < j} d_{ij}(X) - \hat{d}_{ij}(X)`
Let :math:`\delta_{ij}` be the dissimilarity matrix between the
:math:`n` input points (possibly arising as some pairwise distances
:math:`d_{ij}(X)` between the coordinates :math:`X` of the input points).
Disparities :math:`\hat{d}_{ij} = f(\delta_{ij})` are some transformation of
the dissimilarities. The MDS objective, called the raw stress, is then
defined by :math:`\sum_{i < j} (\hat{d}_{ij} - d_{ij}(Z))^2`,
where :math:`d_{ij}(Z)` are the pairwise distances between the
coordinates :math:`Z` of the embedded points.


.. dropdown:: Metric MDS

The simplest metric :class:`MDS` model, called *absolute MDS*, disparities are defined by
:math:`\hat{d}_{ij} = S_{ij}`. With absolute MDS, the value :math:`S_{ij}`
should then correspond exactly to the distance between point :math:`i` and
:math:`j` in the embedding point.

Most commonly, disparities are set to :math:`\hat{d}_{ij} = b S_{ij}`.
In the metric :class:`MDS` model (sometimes also called *absolute MDS*),
disparities are simply equal to the input dissimilarities
:math:`\hat{d}_{ij} = \delta_{ij}`.

.. dropdown:: Nonmetric MDS

Non metric :class:`MDS` focuses on the ordination of the data. If
:math:`S_{ij} > S_{jk}`, then the embedding should enforce :math:`d_{ij} <
d_{jk}`. For this reason, we discuss it in terms of dissimilarities
(:math:`\delta_{ij}`) instead of similarities (:math:`S_{ij}`). Note that
dissimilarities can easily be obtained from similarities through a simple
transform, e.g. :math:`\delta_{ij}=c_1-c_2 S_{ij}` for some real constants
:math:`c_1, c_2`. A simple algorithm to enforce proper ordination is to use a
monotonic regression of :math:`d_{ij}` on :math:`\delta_{ij}`, yielding
disparities :math:`\hat{d}_{ij}` in the same order as :math:`\delta_{ij}`.

A trivial solution to this problem is to set all the points on the origin. In
order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized. Note
that since we only care about relative ordering, our objective should be
:math:`\delta_{ij} > \delta_{kl}`, then the embedding
seeks to enforce :math:`d_{ij}(Z) > d_{kl}(Z)`. A simple algorithm
to enforce proper ordination is to use an
isotonic regression of :math:`d_{ij}(Z)` on :math:`\delta_{ij}`, yielding
disparities :math:`\hat{d}_{ij}` that are a monotonic transformation
of dissimilarities :math:`\delta_{ij}` and hence having the same ordering.
This is done repeatedly after every step of the optimization algorithm.
In order to avoid the trivial solution where all embedding points are
overlapping, the disparities :math:`\hat{d}_{ij}` are normalized.

Note that since we only care about relative ordering, our objective should be
invariant to simple translation and scaling, however the stress used in metric
MDS is sensitive to scaling. To address this, non-metric MDS may use a
normalized stress, known as Stress-1 defined as
MDS is sensitive to scaling. To address this, non-metric MDS returns
normalized stress, also known as Stress-1, defined as

.. math::
\sqrt{\frac{\sum_{i < j} (d_{ij} - \hat{d}_{ij})^2}{\sum_{i < j} d_{ij}^2}}.
\sqrt{\frac{\sum_{i < j} (\hat{d}_{ij} - d_{ij}(Z))^2}{\sum_{i < j}
d_{ij}(Z)^2}}.

The use of normalized Stress-1 can be enabled by setting `normalized_stress=True`,
however it is only compatible with the non-metric MDS problem and will be ignored
in the metric case.
Normalized Stress-1 is returned if `normalized_stress=True`.

.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_mds_001.png
:target: ../auto_examples/manifold/plot_mds.html
Expand All @@ -487,6 +485,10 @@ stress, is then defined by :math:`\sum_{i < j} d_{ij}(X) - \hat{d}_{ij}(X)`

.. rubric:: References

* `"More on Multidimensional Scaling and Unfolding in R: smacof Version 2"
<https://www.jstatsoft.org/article/view/v102i10>`_
Mair P, Groenen P., de Leeuw J. Journal of Statistical Software (2022)

* `"Modern Multidimensional Scaling - Theory and Applications"
<https://www.springer.com/fr/book/9780387251509>`_
Borg, I.; Groenen P. Springer Series in Statistics (1997)
Expand Down
4 changes: 4 additions & 0 deletions doc/whats_new/upcoming_changes/sklearn.manifold/30514.fix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- :class:`manifold.MDS` now correctly handles non-metric MDS. Furthermore,
the returned stress value now corresponds to the returned embedding and
normalized stress is now allowed for metric MDS.
By :user:`Dmitry Kobak <dkobak>`
64 changes: 36 additions & 28 deletions examples/manifold/plot_mds.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,79 +21,87 @@
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances

# Generate the data
EPSILON = np.finfo(np.float32).eps
n_samples = 20
seed = np.random.RandomState(seed=3)
X_true = seed.randint(0, 20, 2 * n_samples).astype(float)
rng = np.random.RandomState(seed=3)
X_true = rng.randint(0, 20, 2 * n_samples).astype(float)
X_true = X_true.reshape((n_samples, 2))

# Center the data
X_true -= X_true.mean()

similarities = euclidean_distances(X_true)
# Compute pairwise Euclidean distances
distances = euclidean_distances(X_true)

# Add noise to the similarities
noise = np.random.rand(n_samples, n_samples)
# Add noise to the distances
noise = rng.rand(n_samples, n_samples)
noise = noise + noise.T
noise[np.arange(noise.shape[0]), np.arange(noise.shape[0])] = 0
similarities += noise
np.fill_diagonal(noise, 0)
distances += noise

mds = manifold.MDS(
n_components=2,
max_iter=3000,
eps=1e-9,
random_state=seed,
random_state=42,
dissimilarity="precomputed",
n_jobs=1,
)
pos = mds.fit(similarities).embedding_
X_mds = mds.fit(distances).embedding_

nmds = manifold.MDS(
n_components=2,
metric=False,
max_iter=3000,
eps=1e-12,
dissimilarity="precomputed",
random_state=seed,
random_state=42,
n_jobs=1,
n_init=1,
)
npos = nmds.fit_transform(similarities, init=pos)
X_nmds = nmds.fit_transform(distances)

# Rescale the data
pos *= np.sqrt((X_true**2).sum()) / np.sqrt((pos**2).sum())
npos *= np.sqrt((X_true**2).sum()) / np.sqrt((npos**2).sum())
X_mds *= np.sqrt((X_true**2).sum()) / np.sqrt((X_mds**2).sum())
X_nmds *= np.sqrt((X_true**2).sum()) / np.sqrt((X_nmds**2).sum())

# Rotate the data
clf = PCA(n_components=2)
X_true = clf.fit_transform(X_true)

pos = clf.fit_transform(pos)

npos = clf.fit_transform(npos)
pca = PCA(n_components=2)
X_true = pca.fit_transform(X_true)
X_mds = pca.fit_transform(X_mds)
X_nmds = pca.fit_transform(X_nmds)

# Align the sign of PCs
for i in [0, 1]:
if np.corrcoef(X_mds[:, i], X_true[:, i])[0, 1] < 0:
X_mds[:, i] *= -1
if np.corrcoef(X_nmds[:, i], X_true[:, i])[0, 1] < 0:
X_nmds[:, i] *= -1

fig = plt.figure(1)
ax = plt.axes([0.0, 0.0, 1.0, 1.0])

s = 100
plt.scatter(X_true[:, 0], X_true[:, 1], color="navy", s=s, lw=0, label="True Position")
plt.scatter(pos[:, 0], pos[:, 1], color="turquoise", s=s, lw=0, label="MDS")
plt.scatter(npos[:, 0], npos[:, 1], color="darkorange", s=s, lw=0, label="NMDS")
plt.scatter(X_mds[:, 0], X_mds[:, 1], color="turquoise", s=s, lw=0, label="MDS")
plt.scatter(X_nmds[:, 0], X_nmds[:, 1], color="darkorange", s=s, lw=0, label="NMDS")
plt.legend(scatterpoints=1, loc="best", shadow=False)

similarities = similarities.max() / (similarities + EPSILON) * 100
np.fill_diagonal(similarities, 0)
# Plot the edges
start_idx, end_idx = np.where(pos)
start_idx, end_idx = np.where(X_mds)
# a sequence of (*line0*, *line1*, *line2*), where::
# linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [
[X_true[i, :], X_true[j, :]] for i in range(len(pos)) for j in range(len(pos))
[X_true[i, :], X_true[j, :]] for i in range(len(X_true)) for j in range(len(X_true))
]
values = np.abs(similarities)
edges = distances.max() / (distances + EPSILON) * 100
np.fill_diagonal(edges, 0)
edges = np.abs(edges)
lc = LineCollection(
segments, zorder=0, cmap=plt.cm.Blues, norm=plt.Normalize(0, values.max())
segments, zorder=0, cmap=plt.cm.Blues, norm=plt.Normalize(0, edges.max())
)
lc.set_array(similarities.flatten())
lc.set_array(edges.flatten())
lc.set_linewidths(np.full(len(segments), 0.5))
ax.add_collection(lc)

Expand Down
Loading