Skip to content

Agglomerative clustering training error for seuclidean/mahalanobis affinity and single linkage #26961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
albin02t opened this issue Aug 1, 2023 · 6 comments · May be fixed by #30528
Open

Comments

@albin02t
Copy link

albin02t commented Aug 1, 2023

Describe the bug

When trying Agglomerative clustering model training with the affinity as 'seuclidean' or 'mahalanobis' and the linkage as 'single' the training fails. The same affinity values along with other linkage such as 'average' options executes for model training. There's no specification given for this issue. Also, in the code I can see the handling for the single linkage is different and there is some cython code which is not accessible.

Steps/Code to Reproduce


from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris
model = AgglomerativeClustering(affinity='mahalanobis', linkage='average')
data = load_iris(as_frame=True)['data']
model.fit(data)

Expected Results

No error should be thrown sig

Actual Results

image

Versions

System:
    python: 3.8.17 (default, Jul  5 2023, 21:04:15)  [GCC 11.2.0]
executable: /home/albint/miniconda3/envs/myenv/bin/python
   machine: Linux-5.15.0-78-generic-x86_64-with-glibc2.17

Python dependencies:
      sklearn: 1.3.0
          pip: 23.1.2
   setuptools: 67.8.0
        numpy: 1.24.4
        scipy: 1.10.1
       Cython: 3.0.0
       pandas: 2.0.3
   matplotlib: 3.7.2
       joblib: 1.3.1
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /home/albint/miniconda3/envs/myenv/lib/python3.8/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /home/albint/miniconda3/envs/myenv/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libgomp
       filepath: /home/albint/miniconda3/envs/myenv/lib/python3.8/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
@albin02t albin02t added Bug Needs Triage Issue requires triage labels Aug 1, 2023
@lmcinnes
Copy link
Contributor

lmcinnes commented Aug 3, 2023

The other linkage implementations use scipy.cluster.hierarchy.linkage which calls pdist which has default values (in the validator) for the extra args required for seuclidean and mahalanobis. For consistency the right thing to do would be to catch these special metrics and provide the required keywords. In the longer term it may be worth looking into the get_metric function in DistanceMetric to make this more consistent.

@Micky774
Copy link
Contributor

Micky774 commented Aug 11, 2023

So a few things in play here.

First, the documentation for affinity is outdated and incorrect (note that the keyword is also deprecated), whereas the documentation for metric more accurate but still not complete. What's missing is that when connectivity=None and linkage="single" (and affinity!="precomputed"), any valid pairwise distance metric (i.e. metric in sklearn.metrics.pairwise._VALID_METRICS) works.

Second, the support for general pairwise distance metrics can actually be extended to even when connectivity!=None and linkage!="single" if we ditch the use of paired_distances and use the underlying distance metric directly, though it may a be a bit tricky to ensure no performance regression. This indeed must happen once we deprecate paired_distances (#26982).

Third, we do not have any mechanism to forward metric keyword arguments to the underlying metric, so those like "mahalanobis" and "seuclidean" will not be usable until we fix this.

Each of these could probably use their own PR 😅

@Micky774 Micky774 removed the Needs Triage Issue requires triage label Aug 11, 2023
@zeny07spartan
Copy link

Hello ! I am new on GitHub and I would like to contribute.
I'll try to make a PR for the third problem pointed out by @Micky774

@Higgs32584
Copy link
Contributor

So I am just checking here, do we want to depreciate all affinity parameters in sklearn/cluster/_agglomerative.py? There are around 28 instances both in documentation and code of affinity, but only 3 mentions in the AgglomerativeClustering class.

there are also numerous test cases mentioning affinity in some private classes as well as public there

@Sean-Jay-M
Copy link
Contributor

Sean-Jay-M commented Sep 26, 2024

First, the documentation for affinity is outdated and incorrect (note that the keyword is also deprecated), whereas the documentation for metric more accurate but still not complete. What's missing is that when connectivity=None and linkage="single" (and affinity!="precomputed"), any valid pairwise distance metric (i.e. metric in sklearn.metrics.pairwise._VALID_METRICS) works.

It is my understanding that @Micky774 has split this problem into three distinct categories. I claim this. View my PR here:

I will continue with the other categories sequentially.

@SuccessMoses
Copy link
Contributor

Second, the support for general pairwise distance metrics can actually be extended to even when connectivity!=None and linkage!="single" if we ditch the use of paired_distances and use the underlying distance metric directly, though it may a be a bit tricky to ensure no performance regression. This indeed must happen once we deprecate paired_distances (#26982).

@Micky774 this sounds like it would required a PR of its own. Should you create an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Moderate
Development

Successfully merging a pull request may close this issue.

7 participants