Skip to content

sklearn MDS vs skbio PCoA #15272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
maxibor opened this issue Oct 16, 2019 · 23 comments · May be fixed by #22330 or #31322
Open

sklearn MDS vs skbio PCoA #15272

maxibor opened this issue Oct 16, 2019 · 23 comments · May be fixed by #22330 or #31322
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:manifold

Comments

@maxibor
Copy link

maxibor commented Oct 16, 2019

Multi Dimensional Scaling (MDS) and Principal Coordinate Analysis (PCoA) are two names for the same dimension reduction technique*.

In scikit-learn, MDS is implemented with the SMACOF algorithm while in other Python libraries (such as scikit-bio) and most R packages offering it, it is implemented using singular value decomposition.

This is usually quite confusing for people who try out the sklearn implementation of MDS when comparing it with other MDS implementations (sklearn stands out).

How could one add another implementation of MDS in sklearn ? Or maybe create a new PCoA method ?

cc @adrinjalali

@adrinjalali
Copy link
Member

I'd be happy to have the other implementation as an alternative, and a parameter to set is as the algorithm.

@adrinjalali adrinjalali added Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Oct 16, 2019
@amueller
Copy link
Member

related to #4485 ?

@amueller
Copy link
Member

What's the relation between PCA and PCoA? They seem quite similar but I'm not really familiar with MDS and have never heard the term PCoA

@maxibor
Copy link
Author

maxibor commented Oct 18, 2019

What's the relation between PCA and PCoA? They seem quite similar but I'm not really familiar with MDS and have never heard the term PCoA

PCoA is essentially a PCA, but starting for a distance matrix (which doesn't need to be euclidean).
It seems indeed, that it is mostly what #4485 is implementing.

The scikit-bio implementation is quite nice: _principal_coordinate_analysis.py

@rth
Copy link
Member

rth commented Oct 18, 2019

It seems indeed, that is mostly what #4485 is implementing.

@maxibor You would be very welcome to continue that PR.

@panpiort8
Copy link
Contributor

Is anyone working on that? If not, I can take it.

@adrinjalali
Copy link
Member

Seems like it's available, feel free to take it @panpiort8

@panpiort8
Copy link
Contributor

Ok, I'll take it (then I suppose 'help wanted' is misleading)

@panpiort8
Copy link
Contributor

Could someone review above PR?

@adrinjalali
Copy link
Member

We're busy mostly with the release at the moment. We should be able to focus on this more if you ping in 3 weeks ;)

@panpiort8
Copy link
Contributor

Ok, I fully understand. I'll ping soon :)

@panpiort8
Copy link
Contributor

gentle ping (;

@panpiort8
Copy link
Contributor

Could someone review above PR? (gentle ping no. 2)

@earmingol
Copy link

Could someone review above PR? (gentle ping no. 2)

Has this been finally implemented? I read the documentation, but there is nothing about SVD yet :(

@panpiort8
Copy link
Contributor

It's almost implemented (PR #16067), but still needs pretty much work to be merged. Feel free to continue this PR.

@Micky774 Micky774 linked a pull request Jan 29, 2022 that will close this issue
@jolespin
Copy link

jolespin commented Feb 3, 2022

Has there been any progress on this recently? Looking forward to decreasing my reliance on scikit-bio for PCoA.

@Micky774
Copy link
Contributor

Micky774 commented Feb 8, 2022

@jolespin Current progress is in PR #22330 which provides and SVD implementation to the metric MDS context.

@dkobak
Copy link
Contributor

dkobak commented Oct 23, 2024

There is some confusion in the discussion above.

PCoA is another name for "classical MDS". It takes a matrix of pairwise distances, performs some linear algebra and eigendecomposition, and gets the embedding. The loss function that is minimized is sometimes called "strain". If the input pairwise distances are Euclidean distances between some vectors, then this procedure is equivalent to PCA of those vectors. In contrast, "metric MDS" optimizes a DIFFERENT loss function called "stress", and can only be solved iteratively. These are two different optimization problems with two different solutions. There is also "non-metric MDS" which is a yet another different thing.

I fully agree that classical MDS aka PCoA should be implemented in sklearn. My question is: what is the best API for that?

Ideally I would prefer algorithm={"metric", "classical", "non-metric"}. But currently there is metric={True, False} switch. So the least invasive option (that would not require a deprecation cycle) would be to add a separate parameter classical={True, False}.

What do people think?

@dkobak
Copy link
Contributor

dkobak commented Nov 4, 2024

@amueller @adrinjalali What do you think about the optimal API (see my comment above)? Does classical={True, False} look like a reasonable solution? Or should we go with algorithm={"metric", "classical", "non-metric"} and gradually deprecate metric={True, False}?

@adrinjalali
Copy link
Member

I think the algorithm path is the nicest API FAICT. To move this forward we need reviewers for #22330

@dd1735
Copy link

dd1735 commented Nov 17, 2024

I would like to understand the specific stress function currently implemented in scikit-learn for both metric and non-metric MDS. Does the implementation follow the Kruskal stress function as described in the referenced papers in the documentation? If not, could you kindly clarify the formula used for each and any key differences?
I previously raised this #30240 but was advised to add my thoughts here. Thank you for your patience!

@dkobak
Copy link
Contributor

dkobak commented Nov 18, 2024

I would like to understand the specific stress function currently implemented in scikit-learn for both metric and non-metric MDS.

I replied in more detail in your closed issue, but I think Wikipedia has a good overview:

@dkobak dkobak linked a pull request May 6, 2025 that will close this issue
@dkobak
Copy link
Contributor

dkobak commented May 6, 2025

I made a new PR to implement classical MDS aka PCoA: #31322.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:manifold
Projects
None yet