Skip to content

Implement classical MDS #31322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/modules/manifold.rst
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,15 @@ coordinates :math:`Z` of the embedded points.
:align: center
:scale: 60

Apart from that, there is a version called *classical MDS*, also known as
*principal coordinates analysis (PCoA)* or *Torgerson's scaling*, and implemented
in the separate :class:`ClassicalMDS` class. Classical MDS replaces the stress
loss function with a different loss function called *strain*, which allows
exact solution in terms of eigendecomposition of the double-centered dissimilarity
matrix. If the dissimilarity matrix is the matrix of pairwise Euclidean distances
between some vectors, then classical MDS is equivalent to PCA of this set of
vectors.

.. rubric:: References

* `"More on Multidimensional Scaling and Unfolding in R: smacof Version 2"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
- :class:`manifold.ClassicalMDS` was implemented to perform classical MDS
(eigendecomposition of the double-centered distance matrix). Furthermore,
:class:`manifold.MDS` now supports different pairwise distance metrics,
not only the Euclidean metric.
By :user:`Dmitry Kobak <dkobak>`
1 change: 1 addition & 0 deletions examples/manifold/plot_compare_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ def add_2d_scatter(ax, points, points_color, title=None):
n_components=n_components,
max_iter=50,
n_init=1,
init="classical_mds",
random_state=0,
normalized_stress=False,
)
Expand Down
2 changes: 1 addition & 1 deletion examples/manifold/plot_lle_digits.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ def plot_embedding(X, title):
"LTSA LLE embedding": LocallyLinearEmbedding(
n_neighbors=n_neighbors, n_components=2, method="ltsa"
),
"MDS embedding": MDS(n_components=2, n_init=1, max_iter=120, eps=1e-6),
"MDS embedding": MDS(n_components=2, n_init=1, init="classical_mds"),
"Random Trees embedding": make_pipeline(
RandomTreesEmbedding(n_estimators=200, max_depth=5, random_state=0),
TruncatedSVD(n_components=2),
Expand Down
4 changes: 2 additions & 2 deletions examples/manifold/plot_mds.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@
mds = manifold.MDS(
n_components=2,
max_iter=3000,
eps=1e-9,
n_init=1,
init="random",
random_state=42,
dissimilarity="precomputed",
n_jobs=1,
Expand All @@ -66,7 +66,7 @@
n_components=2,
metric=False,
max_iter=3000,
eps=1e-12,
init="random",
dissimilarity="precomputed",
random_state=42,
n_jobs=1,
Expand Down
2 changes: 2 additions & 0 deletions sklearn/manifold/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

from ._classical_mds import ClassicalMDS
from ._isomap import Isomap
from ._locally_linear import LocallyLinearEmbedding, locally_linear_embedding
from ._mds import MDS, smacof
Expand All @@ -12,6 +13,7 @@
__all__ = [
"MDS",
"TSNE",
"ClassicalMDS",
"Isomap",
"LocallyLinearEmbedding",
"SpectralEmbedding",
Expand Down
182 changes: 182 additions & 0 deletions sklearn/manifold/_classical_mds.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
"""
Classical multi-dimensional scaling (classical MDS).
"""

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

from numbers import Integral

import numpy as np
from scipy import linalg

from ..base import BaseEstimator, _fit_context
from ..metrics import pairwise_distances
from ..utils import check_symmetric
from ..utils._param_validation import Interval
from ..utils.extmath import svd_flip
from ..utils.validation import validate_data


class ClassicalMDS(BaseEstimator):
"""Classical multidimensional scaling.

Read more in the :ref:`User Guide <multidimensional_scaling>`.

Parameters
----------
n_components : int, default=2
Number of embedding dimensions.

dissimilarity : str or callable, default='euclidean'
Metric to use for dissimilarity computation. Default is "euclidean".
See the documentation of `scipy.spatial.distance
<https://docs.scipy.org/doc/scipy/reference/spatial.distance.html>`_ and
the metrics listed in
:class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric
values.

If metric is "precomputed", X is assumed to be a distance matrix and
must be square during fit.

If metric is a callable function, it takes two arrays representing 1D
vectors as inputs and must return one value indicating the distance
between those vectors. This works for Scipy's metrics, but is less
efficient than passing the metric name as a string.

Attributes
----------
embedding_ : ndarray of shape (n_samples, n_components)
Stores the position of the dataset in the embedding space.

dissimilarity_matrix_ : ndarray of shape (n_samples, n_samples)
Pairwise dissimilarities between the points.

eigenvalues_ : ndarray of shape (n_components,)
Eigenvalues of the double-centered dissimilarity matrix, corresponding
to each of the selected components. They are equal to the squared 2-norms
of the `n_components` variables in the embedding space.

n_features_in_ : int
Number of features seen during :term:`fit`.

feature_names_in_ : ndarray of shape (`n_features_in_`,)
Names of features seen during :term:`fit`. Defined only when `X`
has feature names that are all strings.

See Also
--------
sklearn.decomposition.PCA : Principal component analysis.
MDS : Metric and non-metric MDS.

References
----------
.. [1] "Modern Multidimensional Scaling - Theory and Applications" Borg, I.;
Groenen P. Springer Series in Statistics (1997)

Examples
--------
>>> from sklearn.datasets import load_digits
>>> from sklearn.manifold import ClassicalMDS
>>> X, _ = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> cmds = ClassicalMDS(n_components=2)
>>> X_emb = cmds.fit_transform(X[:100])
>>> X_emb.shape
(100, 2)
"""

_parameter_constraints: dict = {
"n_components": [Interval(Integral, 1, None, closed="left")],
"dissimilarity": [str, callable],
}

def __init__(
self,
n_components=2,
*,
dissimilarity="euclidean",
):
self.n_components = n_components
self.dissimilarity = dissimilarity

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.pairwise = self.dissimilarity == "precomputed"
return tags

def fit(self, X, y=None):
"""
Compute the embedding positions.

Parameters
----------
X : array-like of shape (n_samples, n_features) or \
(n_samples, n_samples)
Input data. If ``dissimilarity=='precomputed'``, the input should
be the dissimilarity matrix.

y : Ignored
Not used, present for API consistency by convention.

Returns
-------
self : object
Fitted estimator.
"""
self.fit_transform(X)
return self

@_fit_context(prefer_skip_nested_validation=True)
def fit_transform(self, X, y=None):
"""
Compute the embedding positions.

Parameters
----------
X : array-like of shape (n_samples, n_features) or \
(n_samples, n_samples)
Input data. If ``dissimilarity=='precomputed'``, the input should
be the dissimilarity matrix.

y : Ignored
Not used, present for API consistency by convention.

Returns
-------
X_new : ndarray of shape (n_samples, n_components)
The embedding coordinates.
"""

X = validate_data(self, X)

if self.dissimilarity == "precomputed":
self.dissimilarity_matrix_ = X
self.dissimilarity_matrix_ = check_symmetric(
self.dissimilarity_matrix_, raise_exception=True
)
else:
self.dissimilarity_matrix_ = pairwise_distances(
X, metric=self.dissimilarity
)

# Double centering
B = self.dissimilarity_matrix_**2
B = B.astype(np.float64)
B -= np.mean(B, axis=0)
B -= np.mean(B, axis=1, keepdims=True)
B *= -0.5

# Eigendecomposition
w, U = linalg.eigh(B)
w = w[::-1][: self.n_components]
U = U[:, ::-1][:, : self.n_components]

# Set the signs of eigenvectors to enforce deterministic output
U, _ = svd_flip(U, None)

self.embedding_ = np.sqrt(w) * U
self.eigenvalues_ = w

return self.embedding_
Loading