-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FEA Implement classical MDS #31322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dkobak
wants to merge
19
commits into
scikit-learn:main
Choose a base branch
from
dkobak:classical-mds
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+275
−0
Open
FEA Implement classical MDS #31322
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
33b9600
Implement Classical MDS
dkobak 5184552
Rename what's new file
dkobak c215d0a
validate input data
dkobak d6773c7
describe attributes
dkobak b4f49d5
Merge branch 'main' into classical-mds
dkobak f018564
Make classical MDS default init for MDS
dkobak 99854db
Fix tests
dkobak 8134b9f
Merge branch 'main' into classical-mds
dkobak bafac04
Take all changes to the MDS class out of this PR
dkobak c408bd3
Add metric_params
dkobak 77732f6
Fix kwargs
dkobak 804c524
Rename kwargs
dkobak 0999113
Rename dissimilarity into metric
dkobak da28a2d
Add a test for metric_params
dkobak 509c6d1
Apply suggestions from code review
dkobak dbaccd0
lint
dkobak 2f66df3
Add new class to the API reference
dkobak 737aaea
Improve tests
dkobak eb37e39
Merge branch 'main' into classical-mds
dkobak File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3 changes: 3 additions & 0 deletions
3
doc/whats_new/upcoming_changes/sklearn.manifold/31322.feature.rst
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs to be a |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
- :class:`manifold.ClassicalMDS` was implemented to perform classical MDS | ||
(eigendecomposition of the double-centered distance matrix). | ||
By :user:`Dmitry Kobak <dkobak>` and :user:`Meekail Zain <Micky774>` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
""" | ||
Classical multi-dimensional scaling (classical MDS). | ||
""" | ||
|
||
# Authors: The scikit-learn developers | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
|
||
from numbers import Integral | ||
|
||
import numpy as np | ||
from scipy import linalg | ||
|
||
from ..base import BaseEstimator, _fit_context | ||
from ..metrics import pairwise_distances | ||
from ..utils import check_symmetric | ||
from ..utils._param_validation import Interval | ||
from ..utils.extmath import svd_flip | ||
from ..utils.validation import validate_data | ||
|
||
|
||
class ClassicalMDS(BaseEstimator): | ||
"""Classical multidimensional scaling. | ||
|
||
Read more in the :ref:`User Guide <multidimensional_scaling>`. | ||
|
||
Parameters | ||
---------- | ||
n_components : int, default=2 | ||
Number of embedding dimensions. | ||
|
||
metric : str or callable, default='euclidean' | ||
Metric to use for dissimilarity computation. Default is "euclidean". | ||
|
||
If metric is a string, it must be one of the options allowed by | ||
`scipy.spatial.distance.pdist` for its metric parameter, or a metric | ||
listed in :func:`sklearn.metrics.pairwise.distance_metrics` | ||
|
||
If metric is "precomputed", X is assumed to be a distance matrix and | ||
must be square during fit. | ||
|
||
If metric is a callable function, it takes two arrays representing 1D | ||
vectors as inputs and must return one value indicating the distance | ||
between those vectors. This works for Scipy's metrics, but is less | ||
efficient than passing the metric name as a string. | ||
|
||
metric_params : dict, default=None | ||
Additional keyword arguments for the dissimilarity computation. | ||
|
||
Attributes | ||
---------- | ||
embedding_ : ndarray of shape (n_samples, n_components) | ||
Stores the position of the dataset in the embedding space. | ||
|
||
dissimilarity_matrix_ : ndarray of shape (n_samples, n_samples) | ||
Pairwise dissimilarities between the points. | ||
|
||
eigenvalues_ : ndarray of shape (n_components,) | ||
Eigenvalues of the double-centered dissimilarity matrix, corresponding | ||
to each of the selected components. They are equal to the squared 2-norms | ||
of the `n_components` variables in the embedding space. | ||
|
||
n_features_in_ : int | ||
Number of features seen during :term:`fit`. | ||
|
||
feature_names_in_ : ndarray of shape (`n_features_in_`,) | ||
Names of features seen during :term:`fit`. Defined only when `X` | ||
has feature names that are all strings. | ||
|
||
See Also | ||
-------- | ||
sklearn.decomposition.PCA : Principal component analysis. | ||
MDS : Metric and non-metric MDS. | ||
|
||
References | ||
---------- | ||
.. [1] "Modern Multidimensional Scaling - Theory and Applications" Borg, I.; | ||
Groenen P. Springer Series in Statistics (1997) | ||
|
||
Examples | ||
-------- | ||
>>> from sklearn.datasets import load_digits | ||
>>> from sklearn.manifold import ClassicalMDS | ||
>>> X, _ = load_digits(return_X_y=True) | ||
>>> X.shape | ||
(1797, 64) | ||
>>> cmds = ClassicalMDS(n_components=2) | ||
>>> X_emb = cmds.fit_transform(X[:100]) | ||
>>> X_emb.shape | ||
(100, 2) | ||
""" | ||
|
||
_parameter_constraints: dict = { | ||
"n_components": [Interval(Integral, 1, None, closed="left")], | ||
"metric": [str, callable], | ||
"metric_params": [dict, None], | ||
} | ||
|
||
def __init__( | ||
self, | ||
n_components=2, | ||
*, | ||
metric="euclidean", | ||
metric_params=None, | ||
): | ||
self.n_components = n_components | ||
self.metric = metric | ||
self.metric_params = metric_params | ||
|
||
def __sklearn_tags__(self): | ||
tags = super().__sklearn_tags__() | ||
tags.input_tags.pairwise = self.metric == "precomputed" | ||
return tags | ||
|
||
def fit(self, X, y=None): | ||
""" | ||
Compute the embedding positions. | ||
|
||
Parameters | ||
---------- | ||
X : array-like of shape (n_samples, n_features) or \ | ||
(n_samples, n_samples) | ||
Input data. If ``metric=='precomputed'``, the input should | ||
be the dissimilarity matrix. | ||
|
||
y : Ignored | ||
Not used, present for API consistency by convention. | ||
|
||
Returns | ||
------- | ||
self : object | ||
Fitted estimator. | ||
""" | ||
self.fit_transform(X) | ||
return self | ||
|
||
@_fit_context(prefer_skip_nested_validation=True) | ||
def fit_transform(self, X, y=None): | ||
""" | ||
Compute and return the embedding positions. | ||
|
||
Parameters | ||
---------- | ||
X : array-like of shape (n_samples, n_features) or \ | ||
(n_samples, n_samples) | ||
Input data. If ``metric=='precomputed'``, the input should | ||
be the dissimilarity matrix. | ||
|
||
y : Ignored | ||
Not used, present for API consistency by convention. | ||
|
||
Returns | ||
------- | ||
X_new : ndarray of shape (n_samples, n_components) | ||
The embedding coordinates. | ||
""" | ||
|
||
X = validate_data(self, X) | ||
|
||
if self.metric == "precomputed": | ||
self.dissimilarity_matrix_ = X | ||
self.dissimilarity_matrix_ = check_symmetric( | ||
self.dissimilarity_matrix_, raise_exception=True | ||
) | ||
else: | ||
self.dissimilarity_matrix_ = pairwise_distances( | ||
X, | ||
metric=self.metric, | ||
**(self.metric_params if self.metric_params is not None else {}), | ||
) | ||
|
||
# Double centering | ||
B = self.dissimilarity_matrix_**2 | ||
B = B.astype(np.float64) | ||
B -= np.mean(B, axis=0) | ||
B -= np.mean(B, axis=1, keepdims=True) | ||
B *= -0.5 | ||
|
||
# Eigendecomposition | ||
w, U = linalg.eigh(B) | ||
|
||
# Reversing the order of the eigenvalues/eigenvectors to put | ||
# the eigenvalues in decreasing order | ||
w = w[::-1][: self.n_components] | ||
U = U[:, ::-1][:, : self.n_components] | ||
|
||
# Set the signs of eigenvectors to enforce deterministic output | ||
U, _ = svd_flip(U, None) | ||
|
||
self.embedding_ = np.sqrt(w) * U | ||
self.eigenvalues_ = w | ||
|
||
return self.embedding_ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
import numpy as np | ||
import pytest | ||
from numpy.testing import assert_allclose | ||
|
||
from sklearn.datasets import load_iris | ||
from sklearn.decomposition import PCA | ||
from sklearn.manifold import ClassicalMDS | ||
from sklearn.metrics import euclidean_distances | ||
|
||
|
||
def test_classical_mds_equivalent_to_pca(): | ||
X, _ = load_iris(return_X_y=True) | ||
|
||
cmds = ClassicalMDS(n_components=2, metric="euclidean") | ||
pca = PCA(n_components=2) | ||
|
||
Z1 = cmds.fit_transform(X) | ||
Z2 = pca.fit_transform(X) | ||
|
||
# Swap the signs if necessary | ||
for comp in range(2): | ||
if Z1[0, comp] < 0 and Z2[0, comp] > 0: | ||
Z2[:, comp] *= -1 | ||
|
||
assert_allclose(Z1, Z2) | ||
|
||
assert_allclose(np.sqrt(cmds.eigenvalues_), pca.singular_values_) | ||
|
||
|
||
def test_classical_mds_equivalent_on_data_and_distances(): | ||
X, _ = load_iris(return_X_y=True) | ||
|
||
cmds = ClassicalMDS(n_components=2, metric="euclidean") | ||
Z1 = cmds.fit_transform(X) | ||
|
||
cmds = ClassicalMDS(n_components=2, metric="precomputed") | ||
Z2 = cmds.fit_transform(euclidean_distances(X)) | ||
|
||
assert_allclose(Z1, Z2) | ||
|
||
|
||
def test_classical_mds_wrong_inputs(): | ||
# Non-symmetric input | ||
dissim = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) | ||
with pytest.raises(ValueError, match="Array must be symmetric"): | ||
ClassicalMDS(metric="precomputed").fit(dissim) | ||
|
||
# Non-square input | ||
dissim = np.array([[0, 1, 2], [3, 4, 5]]) | ||
with pytest.raises(ValueError, match="array must be 2-dimensional and square"): | ||
ClassicalMDS(metric="precomputed").fit(dissim) | ||
|
||
|
||
def test_classical_mds_metric_params(): | ||
X, _ = load_iris(return_X_y=True) | ||
|
||
cmds = ClassicalMDS(n_components=2, metric="euclidean") | ||
Z1 = cmds.fit_transform(X) | ||
|
||
cmds = ClassicalMDS(n_components=2, metric="minkowski", metric_params={"p": 2}) | ||
Z2 = cmds.fit_transform(X) | ||
|
||
assert_allclose(Z1, Z2) | ||
|
||
cmds = ClassicalMDS(n_components=2, metric="minkowski", metric_params={"p": 1}) | ||
Z3 = cmds.fit_transform(X) | ||
|
||
assert not np.allclose(Z1, Z3) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alphabetical order