Skip to content

Partial Fix for Agglomerative clustering training error for seuclidean/mahalanobis affinity and single linkage #26961 #28076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

zeny07spartan
Copy link

@zeny07spartan zeny07spartan commented Jan 7, 2024

#26961 Original issue https://github.com/scikit-learn/scikit-learn/issues/26961

What does this implement/fix? Explain your changes.

Following Mickey774 comments, there are now arguments V or VI in the AgglomerativeClustering class in order to forward a Variance matrix or a Covariance matrix used by Seuclidean or Mahalanobis metrics. Documentation was updated to deprecate the term "affinity" and only use the term "metric". If V=None or VI=None or nothing is passed as argument and metric='seuclidean' or metric='mahalanobis', there will be an auto computation of the Variance or Covariance matrix.

Any other comments?

Tested on Jupyterlab, PyTest passed.

Any comments or suggestions welcome :)

Copy link

github-actions bot commented Jan 7, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=23.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_agglomerative.py	2024-01-07 18:15:33.268071 +0000
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_agglomerative.py	2024-01-07 18:15:43.395611 +0000
@@ -42,11 +42,11 @@
 
 ###############################################################################
 # For non fully-connected graphs
 
 
-def _fix_connectivity(X, connectivity, metric='euclidean'):
+def _fix_connectivity(X, connectivity, metric="euclidean"):
     """
     Fixes the connectivity matrix.
 
     The different steps are:
 
@@ -414,11 +414,11 @@
     n_clusters=None,
     linkage="complete",
     metric="euclidean",
     return_distance=False,
     V=None,
-    VI=None
+    VI=None,
 ):
     """Linkage agglomerative clustering based on a Feature matrix.
 
     The inertia matrix uses a Heapq-based representation.
 
@@ -475,11 +475,11 @@
 
     VI : array-like, default="None"
         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
 
         .. versionadded:: 1.4
-        
+
     Returns
     -------
     children : ndarray of shape (n_nodes-1, 2)
         The children of each non-leaf node. Values less than `n_samples`
         correspond to leaves of the tree which are the original samples.
@@ -569,24 +569,24 @@
             and metric != "precomputed"
             and not callable(metric)
             and metric in METRIC_MAPPING64
         ):
             if metric == "seuclidean":
-                if type(V)!='numpy.ndarray':
-                    V = np.var(X,axis=0)
-                dist_metric = DistanceMetric.get_metric(metric,V=V)
+                if type(V) != "numpy.ndarray":
+                    V = np.var(X, axis=0)
+                dist_metric = DistanceMetric.get_metric(metric, V=V)
             elif metric == "mahalanobis":
-                if type(V)=='numpy.ndarray':
-                    dist_metric = DistanceMetric.get_metric(metric,V=V)
-                elif type(VI)=='numpy.ndarray':
-                    dist_metric = DistanceMetric.get_metric(metric,VI=VI)
+                if type(V) == "numpy.ndarray":
+                    dist_metric = DistanceMetric.get_metric(metric, V=V)
+                elif type(VI) == "numpy.ndarray":
+                    dist_metric = DistanceMetric.get_metric(metric, VI=VI)
                 else:
                     V = np.cov(X)
-                    dist_metric = DistanceMetric.get_metric(metric,V=V)
-                
+                    dist_metric = DistanceMetric.get_metric(metric, V=V)
+
             else:
-            # We need the fast cythonized metric from neighbors
+                # We need the fast cythonized metric from neighbors
                 dist_metric = DistanceMetric.get_metric(metric)
 
             # The Cython routines used require contiguous arrays
             X = np.ascontiguousarray(X, dtype=np.double)
 
@@ -952,12 +952,12 @@
         "connectivity": ["array-like", callable, None],
         "compute_full_tree": [StrOptions({"auto"}), "boolean"],
         "linkage": [StrOptions(set(_TREE_BUILDERS.keys()))],
         "distance_threshold": [Interval(Real, 0, None, closed="left"), None],
         "compute_distances": ["boolean"],
-        "V":["array-like",None],
-        "VI":["array-like",None],
+        "V": ["array-like", None],
+        "VI": ["array-like", None],
     }
 
     def __init__(
         self,
         n_clusters=2,
@@ -969,12 +969,10 @@
         connectivity=None,
         compute_full_tree="auto",
         linkage="ward",
         distance_threshold=None,
         compute_distances=False,
-        
-
     ):
         self.n_clusters = n_clusters
         self.distance_threshold = distance_threshold
         self.memory = memory
         self.connectivity = connectivity
@@ -1088,14 +1086,13 @@
             kwargs["metric"] = self._metric
 
         if self.metric == "mahalanobis":
             kwargs["V"] = self.V
             kwargs["VI"] = self.VI
-            
+
         if self.metric == "seuclidean":
             kwargs["V"] = self.V
-            
 
         distance_threshold = self.distance_threshold
 
         return_distance = (distance_threshold is not None) or self.compute_distances
 
@@ -1280,11 +1277,11 @@
 
     V : array-like, default="None"
         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
         .. versionadded:: 1.4
 
-    VI : array-like, default="None" 
+    VI : array-like, default="None"
         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
         .. versionadded:: 1.4
 
     See Also
     --------
@@ -1319,12 +1316,12 @@
         "compute_full_tree": [StrOptions({"auto"}), "boolean"],
         "linkage": [StrOptions(set(_TREE_BUILDERS.keys()))],
         "pooling_func": [callable],
         "distance_threshold": [Interval(Real, 0, None, closed="left"), None],
         "compute_distances": ["boolean"],
-        "V":["array-like",None],
-        "VI":["array-like",None],
+        "V": ["array-like", None],
+        "VI": ["array-like", None],
     }
 
     def __init__(
         self,
         n_clusters=2,
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/cluster/_agglomerative.py

Oh no! 💥 💔 💥
1 file would be reformatted, 906 files would be left unchanged.

ruff

ruff detected issues. Please run ruff --fix --show-source . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.1.11.


sklearn/cluster/_agglomerative.py:472:89: E501 Line too long (133 > 88)
    |
471 |     V : array-like, default="None"
472 |         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
473 | 
474 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:477:89: E501 Line too long (106 > 88)
    |
476 |     VI : array-like, default="None"
477 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^ E501
478 | 
479 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:480:1: W293 [*] Blank line contains whitespace
    |
479 |         .. versionadded:: 1.4
480 |         
    | ^^^^^^^^ W293
481 |     Returns
482 |     -------
    |
    = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:585:1: W293 [*] Blank line contains whitespace
    |
583 |                     V = np.cov(X)
584 |                     dist_metric = DistanceMetric.get_metric(metric,V=V)
585 |                 
    | ^^^^^^^^^^^^^^^^ W293
586 |             else:
587 |             # We need the fast cythonized metric from neighbors
    |
    = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:811:89: E501 Line too long (133 > 88)
    |
810 |     V : array-like, default="None"
811 |         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
812 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:815:89: E501 Line too long (106 > 88)
    |
814 |     VI : array-like, default="None"
815 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
    |                                                                                         ^^^^^^^^^^^^^^^^^^ E501
816 |         .. versionadded:: 1.4
    |

sklearn/cluster/_agglomerative.py:974:1: W293 [*] Blank line contains whitespace
    |
972 |         distance_threshold=None,
973 |         compute_distances=False,
974 |         
    | ^^^^^^^^ W293
975 | 
976 |     ):
    |
    = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:1093:1: W293 [*] Blank line contains whitespace
     |
1091 |             kwargs["V"] = self.V
1092 |             kwargs["VI"] = self.VI
1093 |             
     | ^^^^^^^^^^^^ W293
1094 |         if self.metric == "seuclidean":
1095 |             kwargs["V"] = self.V
     |
     = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:1096:1: W293 [*] Blank line contains whitespace
     |
1094 |         if self.metric == "seuclidean":
1095 |             kwargs["V"] = self.V
1096 |             
     | ^^^^^^^^^^^^ W293
1097 | 
1098 |         distance_threshold = self.distance_threshold
     |
     = help: Remove whitespace from blank line

sklearn/cluster/_agglomerative.py:1282:89: E501 Line too long (133 > 88)
     |
1281 |     V : array-like, default="None"
1282 |         Variance matrix for Euclidean distance or symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
     |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
1283 |         .. versionadded:: 1.4
     |

sklearn/cluster/_agglomerative.py:1285:36: W291 [*] Trailing whitespace
     |
1283 |         .. versionadded:: 1.4
1284 | 
1285 |     VI : array-like, default="None" 
     |                                    ^ W291
1286 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
1287 |         .. versionadded:: 1.4
     |
     = help: Remove trailing whitespace

sklearn/cluster/_agglomerative.py:1286:89: E501 Line too long (106 > 88)
     |
1285 |     VI : array-like, default="None" 
1286 |         Inverse of the symmetric positive-definite covariance matrix of the data for Mahalanobis distance.
     |                                                                                         ^^^^^^^^^^^^^^^^^^ E501
1287 |         .. versionadded:: 1.4
     |

Found 12 errors.
[*] 6 fixable with the `--fix` option.

Generated for commit: 3434826. Link to the linter CI: here

@adrinjalali
Copy link
Member

@zeny07spartan are you still working on this? You've got some linter issues.

@zeny07spartan
Copy link
Author

Hello @adrinjalali, unfortunately I am not working on this issue anymore as it was part of a school project that ended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants