Partial Fix for Agglomerative clustering training error for seuclidean/mahalanobis affinity and single linkage #26961 #28076

zeny07spartan · 2024-01-07T18:15:18Z

#26961 Original issue https://github.com/scikit-learn/scikit-learn/issues/26961

What does this implement/fix? Explain your changes.

Following Mickey774 comments, there are now arguments V or VI in the AgglomerativeClustering class in order to forward a Variance matrix or a Covariance matrix used by Seuclidean or Mahalanobis metrics. Documentation was updated to deprecate the term "affinity" and only use the term "metric". If V=None or VI=None or nothing is passed as argument and metric='seuclidean' or metric='mahalanobis', there will be an auto computation of the Variance or Covariance matrix.

Any other comments?

Tested on Jupyterlab, PyTest passed.

Any comments or suggestions welcome :)

For Seuclidean and Mahalanobis metrics

github-actions · 2024-01-07T18:16:32Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`black`

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=23.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_agglomerative.py	2024-01-07 18:15:33.268071 +0000
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_agglomerative.py	2024-01-07 18:15:43.395611 +0000
@@ -42,11 +42,11 @@
 
 ###############################################################################
 # For non fully-connected graphs
 
 
-def _fix_connectivity(X, connectivity, metric='euclidean'):
+def _fix_connectivity(X, connectivity, metric="euclidean"):
     """
     Fixes the connectivity matrix.
 
     The different steps are:
 
@@ -414,11 +414,11 @@
     n_clusters=None,
     linkage="complete",
     metric="euclidean",
     return_distance=False,
     V=None,
-    VI=None
+    VI=None,
 ):
     """Linkage agglomerative clustering based on a Feature matrix.
 
     The inertia matrix uses a Heapq-based representation.
 
@@ -475,11 +475,11 @@
 
     VI : array-like, default="None"
         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
 
         .. versionadded:: 1.4
-        
+
     Returns
     -------
     children : ndarray of shape (n_nodes-1, 2)
         The children of each non-leaf node. Values less than `n_samples`
         correspond to leaves of the tree which are the original samples.
@@ -569,24 +569,24 @@
             and metric != "precomputed"
             and not callable(metric)
             and metric in METRIC_MAPPING64
         ):
             if metric == "seuclidean":
-                if type(V)!='numpy.ndarray':
-                    V = np.var(X,axis=0)
-                dist_metric = DistanceMetric.get_metric(metric,V=V)
+                if type(V) != "numpy.ndarray":
+                    V = np.var(X, axis=0)
+                dist_metric = DistanceMetric.get_metric(metric, V=V)
             elif metric == "mahalanobis":
-                if type(V)=='numpy.ndarray':
-                    dist_metric = DistanceMetric.get_metric(metric,V=V)
-                elif type(VI)=='numpy.ndarray':
-                    dist_metric = DistanceMetric.get_metric(metric,VI=VI)
+                if type(V) == "numpy.ndarray":
+                    dist_metric = DistanceMetric.get_metric(metric, V=V)
+                elif type(VI) == "numpy.ndarray":
+                    dist_metric = DistanceMetric.get_metric(metric, VI=VI)
                 else:
                     V = np.cov(X)
-                    dist_metric = DistanceMetric.get_metric(metric,V=V)
-                
+                    dist_metric = DistanceMetric.get_metric(metric, V=V)
+
             else:
-            # We need the fast cythonized metric from neighbors
+                # We need the fast cythonized metric from neighbors
                 dist_metric = DistanceMetric.get_metric(metric)
 
             # The Cython routines used require contiguous arrays
             X = np.ascontiguousarray(X, dtype=np.double)
 
@@ -952,12 +952,12 @@
         "connectivity": ["array-like", callable, None],
         "compute_full_tree": [StrOptions({"auto"}), "boolean"],
         "linkage": [StrOptions(set(_TREE_BUILDERS.keys()))],
         "distance_threshold": [Interval(Real, 0, None, closed="left"), None],
         "compute_distances": ["boolean"],
-        "V":["array-like",None],
-        "VI":["array-like",None],
+        "V": ["array-like", None],
+        "VI": ["array-like", None],
     }
 
     def __init__(
         self,
         n_clusters=2,
@@ -969,12 +969,10 @@
         connectivity=None,
         compute_full_tree="auto",
         linkage="ward",
         distance_threshold=None,
         compute_distances=False,
-        
-
     ):
         self.n_clusters = n_clusters
         self.distance_threshold = distance_threshold
         self.memory = memory
         self.connectivity = connectivity
@@ -1088,14 +1086,13 @@
             kwargs["metric"] = self._metric
 
         if self.metric == "mahalanobis":
             kwargs["V"] = self.V
             kwargs["VI"] = self.VI
-            
+
         if self.metric == "seuclidean":
             kwargs["V"] = self.V
-            
 
         distance_threshold = self.distance_threshold
 
         return_distance = (distance_threshold is not None) or self.compute_distances
 
@@ -1280,11 +1277,11 @@
 
     V : array-like, default="None"
         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
         .. versionadded:: 1.4
 
-    VI : array-like, default="None" 
+    VI : array-like, default="None"
         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
         .. versionadded:: 1.4
 
     See Also
     --------
@@ -1319,12 +1316,12 @@
         "compute_full_tree": [StrOptions({"auto"}), "boolean"],
         "linkage": [StrOptions(set(_TREE_BUILDERS.keys()))],
         "pooling_func": [callable],
         "distance_threshold": [Interval(Real, 0, None, closed="left"), None],
         "compute_distances": ["boolean"],
-        "V":["array-like",None],
-        "VI":["array-like",None],
+        "V": ["array-like", None],
+        "VI": ["array-like", None],
     }
 
     def __init__(
         self,
         n_clusters=2,
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_agglomerative.py

Oh no! 💥 💔 💥
1 file would be reformatted, 906 files would be left unchanged.

`ruff`

ruff detected issues. Please run ruff --fix --show-source . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.1.11.


sklearn/cluster/_agglomerative.py:472:89: E501 Line too long (133 > 88)
    |
471 |     V : array-like, default="None"
472 |         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
473 | 
474 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:477:89: E501 Line too long (106 > 88)
    |
476 |     VI : array-like, default="None"
477 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^ E501
478 | 
479 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:480:1: W293 [*] Blank line contains whitespace
    |
479 |         .. versionadded:: 1.4
480 |         
    | ^^^^^^^^ W293
481 |     Returns
482 |     -------
    |
    = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:585:1: W293 [*] Blank line contains whitespace
    |
583 |                     V = np.cov(X)
584 |                     dist_metric = DistanceMetric.get_metric(metric,V=V)
585 |                 
    | ^^^^^^^^^^^^^^^^ W293
586 |             else:
587 |             # We need the fast cythonized metric from neighbors
    |
    = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:811:89: E501 Line too long (133 > 88)
    |
810 |     V : array-like, default="None"
811 |         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
812 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:815:89: E501 Line too long (106 > 88)
    |
814 |     VI : array-like, default="None"
815 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^ E501
816 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:974:1: W293 [*] Blank line contains whitespace
    |
972 |         distance_threshold=None,
973 |         compute_distances=False,
974 |         
    | ^^^^^^^^ W293
975 | 
976 |     ):
    |
    = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:1093:1: W293 [*] Blank line contains whitespace
     |
1091 |             kwargs["V"] = self.V
1092 |             kwargs["VI"] = self.VI
1093 |             
     | ^^^^^^^^^^^^ W293
1094 |         if self.metric == "seuclidean":
1095 |             kwargs["V"] = self.V
     |
     = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:1096:1: W293 [*] Blank line contains whitespace
     |
1094 |         if self.metric == "seuclidean":
1095 |             kwargs["V"] = self.V
1096 |             
     | ^^^^^^^^^^^^ W293
1097 | 
1098 |         distance_threshold = self.distance_threshold
     |
     = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:1282:89: E501 Line too long (133 > 88)
     |
1281 |     V : array-like, default="None"
1282 |         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
1283 |         .. versionadded:: 1.4
     |

sklearn/cluster/_agglomerative.py:1285:36: W291 [*] Trailing whitespace
     |
1283 |         .. versionadded:: 1.4
1284 | 
1285 |     VI : array-like, default="None" 
     |                                    ^ W291
1286 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
1287 |         .. versionadded:: 1.4
     |
     = help: Remove trailing whitespace

sklearn/cluster/_agglomerative.py:1286:89: E501 Line too long (106 > 88)
     |
1285 |     VI : array-like, default="None" 
1286 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
     |                                                                                         ^^^^^^^^^^^^^^^^^^ E501
1287 |         .. versionadded:: 1.4
     |

Found 12 errors.
[*] 6 fixable with the `--fix` option.

_{Generated for commit: 3434826. Link to the linter CI: here}

adrinjalali · 2024-04-23T14:40:24Z

@zeny07spartan are you still working on this? You've got some linter issues.

zeny07spartan · 2024-04-24T17:40:11Z

Hello @adrinjalali, unfortunately I am not working on this issue anymore as it was part of a school project that ended.

zeny07spartan added 2 commits January 7, 2024 19:03

Documentation and Variance / Covariance arguments

f5ec5a4

For Seuclidean and Mahalanobis metrics

Merge branch 'scikit-learn:main' into main

3434826

github-actions bot added the module:cluster label Jan 7, 2024

adrinjalali closed this Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Fix for Agglomerative clustering training error for seuclidean/mahalanobis affinity and single linkage #26961 #28076

Partial Fix for Agglomerative clustering training error for seuclidean/mahalanobis affinity and single linkage #26961 #28076

zeny07spartan commented Jan 7, 2024 •

edited

Loading

github-actions bot commented Jan 7, 2024

adrinjalali commented Apr 23, 2024

zeny07spartan commented Apr 24, 2024

Partial Fix for Agglomerative clustering training error for seuclidean/mahalanobis affinity and single linkage #26961 #28076

Partial Fix for Agglomerative clustering training error for seuclidean/mahalanobis affinity and single linkage #26961 #28076

Conversation

zeny07spartan commented Jan 7, 2024 • edited Loading

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Jan 7, 2024

❌ Linting issues

black

ruff

adrinjalali commented Apr 23, 2024

zeny07spartan commented Apr 24, 2024

zeny07spartan commented Jan 7, 2024 •

edited

Loading

`black`

`ruff`