Skip to content

FEAT Support precomputed distance matrix for PairwiseDistancesReductions #29483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 80 commits into
base: main
Choose a base branch
from

Conversation

kyrajeep
Copy link

@kyrajeep kyrajeep commented Jul 15, 2024

Towards #25888

  1. What does this implement/fix? Explain your changes.

(1) To take in precomputed distance directly, uses 'compute' in {ArgKmin, RadiusNeighbors}{ClassMode} {32,64}.
(2) When this is received after a XOR check (X, Y or precomputed) initializes an instance of class PrecomputedDistance.
(3) This data is formatted (wrapped?) by datasets_pair.py
(4) It is passed onto the backend by using the get_for() factory method.

  1. Any other comments?
    Thank you to @jjerphan for the reviews!

Co-authors: @jjerphan @adam2392

Copy link

github-actions bot commented Jul 15, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_pairwise_distances_reduction.py	2025-03-10 15:07:57.108550+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_pairwise_distances_reduction.py	2025-03-10 15:08:15.517138+00:00
@@ -37,10 +37,11 @@
     "euclidean",
     "minkowski",
     "seuclidean",
 ]
 
+
 def _get_metric_params_list(metric: str, n_features: int, seed: int = 1):
     """Return list of dummy DistanceMetric kwargs for tests."""
 
     # Distinguishing on cases not to compute unneeded datastructures.
     rng = np.random.RandomState(seed)
@@ -60,10 +61,11 @@
         return [dict(V=rng.rand(n_features))]
 
     # Case of: "euclidean", "manhattan", "chebyshev", "haversine" or any other metric.
     # In those cases, no kwargs is needed.
     return [{}]
+
 
 def assert_same_distances_for_common_neighbors(
     query_idx,
     dist_row_a,
     dist_row_b,
@@ -98,10 +100,11 @@
                 f" for common neighbor with index {idx}:"
                 f" dist_a={dist_a} vs dist_b={dist_b} (with atol={atol} and"
                 f" rtol={rtol})"
             ) from e
 
+
 def assert_precomputed(precomputed, n_samples_X, n_samples_Y):
     """
     Validates a precomputed matrix for compatibility.
 
     Parameters:
@@ -116,11 +119,13 @@
     if not isinstance(precomputed, np.ndarray):
         raise AssertionError("Input must be a numpy array.")
 
     # Check if the array has the correct data type
     if precomputed.dtype not in [np.float32, np.float64]:
-        raise AssertionError("Precomputed matrix must be of type float (float32 or float64).")
+        raise AssertionError(
+            "Precomputed matrix must be of type float (float32 or float64)."
+        )
 
     # Check if the array is empty
     if precomputed.size == 0:
         raise AssertionError("Precomputed matrix should not be empty.")
 
@@ -132,16 +137,12 @@
             f"Expected: {expected_shape}, Got: {precomputed.shape}."
         )
 
 
 def assert_no_missing_neighbors(
-    query_idx,
-    dist_row_a,
-    dist_row_b,
-    indices_row_a,
-    indices_row_b,
-    threshold):
+    query_idx, dist_row_a, dist_row_b, indices_row_a, indices_row_b, threshold
+):
     """Compare the indices of neighbors in two results sets.
 
     Any neighbor index with a distance below the precision threshold should
     match one in the other result set. We ignore the last few neighbors beyond
     the threshold as those can typically be missing due to rounding errors.
@@ -267,10 +268,11 @@
         sampled_dists = pairwise_distances(X, Y, metric=metric, **metric_kwargs)
     else:
         sampled_dists = precomputed_dists[:n_subsampled_queries].copy()
     sampled_dists.sort(axis=1)
     return sampled_dists[:, expected_n_neighbors].mean()
+
 
 def assert_compatible_radius_results(
     neighbors_dists_a,
     neighbors_dists_b,
     neighbors_indices_a,
@@ -373,70 +375,85 @@
         RadiusNeighbors,
         np.float32,
     ): partial(assert_compatible_radius_results, **FLOAT32_TOLS),
 }
 
+
 @pytest.mark.parametrize("cls", [ArgKmin, RadiusNeighbors])
 def test_precompute_all_inputs_none(cls):
     """Test that ValueError is raised when all inputs are None."""
-    with pytest.raises(ValueError, match="Either X and Y or precomputed_matrix must be provided."):
+    with pytest.raises(
+        ValueError, match="Either X and Y or precomputed_matrix must be provided."
+    ):
         cls.compute(X=None, Y=None, precomputed_matrix=None)
+
 
 @pytest.mark.parametrize("cls", [ArgKmin, RadiusNeighbors])
 def test_precompute_all_inputs_provided(cls):
     """Test that ValueError is raised when both X/Y and precomputed_matrix are provided."""
     X = np.random.rand(10, 5)
     Y = np.random.rand(10, 5)
     precomputed_matrix = np.random.rand(10, 10)
-    with pytest.raises(ValueError, match="Only one of X and Y or precomputed_matrix must be provided."):
+    with pytest.raises(
+        ValueError, match="Only one of X and Y or precomputed_matrix must be provided."
+    ):
         cls.compute(X=X, Y=Y, precomputed_matrix=precomputed_matrix)
+
 
 @pytest.mark.parametrize("cls", [ArgKmin, RadiusNeighbors])
 def test_precompute_only_y(cls):
     """Test that ValueError is raised when only Y is provided."""
     Y = np.random.rand(10, 5)
     with pytest.raises(ValueError, match="Y should not be provided without X."):
         cls.compute(X=None, Y=Y)
 
+
 @pytest.mark.parametrize("cls", [ArgKmin, RadiusNeighbors])
 def test_precompute_only_x(cls):
     """Test that ValueError is raised when only X is provided."""
     X = np.random.rand(10, 5)
     with pytest.raises(ValueError, match="X should not be provided without Y."):
         cls.compute(X=X, Y=None)
 
+
 def test_assert_precomputed():
     # Success Case: Valid precomputed matrix
     n_samples_X, n_samples_Y = 5, 5
-    
+
     # Failure Case: Not a numpy array
     with pytest.raises(AssertionError, match="Input must be a numpy array"):
         assert_precomputed([[1, 2], [3, 4]], n_samples_X, n_samples_Y)
 
     # Failure Case: Incorrect dtype
     invalid_dtype = np.random.randint(0, 10, (n_samples_X, n_samples_Y))
-    with pytest.raises(AssertionError, match="Precomputed matrix must be of type float"):
+    with pytest.raises(
+        AssertionError, match="Precomputed matrix must be of type float"
+    ):
         assert_precomputed(invalid_dtype, n_samples_X, n_samples_Y)
 
     # Failure Case: Empty array
     with pytest.raises(AssertionError, match="Precomputed matrix should not be empty"):
         assert_precomputed(np.array([]), n_samples_X, n_samples_Y)
 
     # Failure Case: Incorrect dimensions
     incorrect_shape = np.random.rand(n_samples_X, n_samples_X).astype(np.float32)
-    with pytest.raises(AssertionError, match="Incorrect dimensions for precomputed matrix"):
+    with pytest.raises(
+        AssertionError, match="Incorrect dimensions for precomputed matrix"
+    ):
         assert_precomputed(incorrect_shape, n_samples_X, n_samples_Y)
+
 
 def test_my_function_precomputed():
     X = np.array([[1, 2], [3, 4], [5, 6]])  # Small sample data
     Y = np.array([[7, 8], [9, 10]])
     D = pairwise_distances(X, Y)  # Compute distances ONCE
 
-    result_precomputed = pairwise_distances(D, metric='precomputed')
+    result_precomputed = pairwise_distances(D, metric="precomputed")
     result_computed = pairwise_distances(X, Y)
 
     np.testing.assert_allclose(result_precomputed, result_computed)
+
 
 def test_assert_compatible_argkmin_results():
     atol = 1e-7
     rtol = 0.0
     tols = dict(atol=atol, rtol=rtol)
@@ -571,11 +588,12 @@
             np.array([[2.5, 1.2, _6_1m, 6.1, _6_1p]]),
             np.array([[1, 2, 3, 4, 5]]),
             np.array([[2, 1, 4, 5, 3]]),
             **tols,
         )
-        
+
+
 @pytest.mark.parametrize("check_sorted", [True, False])
 def test_assert_compatible_radius_results(check_sorted):
     atol = 1e-7
     rtol = 0.0
     tols = dict(atol=atol, rtol=rtol)
@@ -1707,11 +1725,11 @@
     unique_Y_labels = np.unique(Y_labels)
     results_X = RadiusNeighborsClassMode.compute(
         X=X,
         Y=Y,
         radius=radius,
-        metric=metric,     
+        metric=metric,
         weights=weights,
         Y_labels=Y_labels,
         unique_Y_labels=unique_Y_labels,
         outlier_label=outlier_label,
         strategy="parallel_on_X",
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/tests/test_pairwise_distances_reduction.py

Oh no! 💥 💔 💥
1 file would be reformatted, 920 files would be left unchanged.

ruff

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:119:12: E712 Avoid equality comparisons to `False`; use `if not is_usable:` for false checks
    |
118 |         # is_usable = (X is not None and Y is not None) ^ bool(precomputed)
119 |         if is_usable == False:
    |            ^^^^^^^^^^^^^^^^^^ E712
120 |             return is_usable
121 |         # FIXME: the current Cython implementation is too slow for a large number of
    |
    = help: Replace with `not is_usable`

sklearn/metrics/tests/test_pairwise_distances_reduction.py:121:89: E501 Line too long (94 > 88)
    |
119 |     # Check if the array has the correct data type
120 |     if precomputed.dtype not in [np.float32, np.float64]:
121 |         raise AssertionError("Precomputed matrix must be of type float (float32 or float64).")
    |                                                                                         ^^^^^^ E501
122 | 
123 |     # Check if the array is empty
    |

sklearn/metrics/tests/test_pairwise_distances_reduction.py:381:89: E501 Line too long (99 > 88)
    |
379 | def test_precompute_all_inputs_none(cls):
380 |     """Test that ValueError is raised when all inputs are None."""
381 |     with pytest.raises(ValueError, match="Either X and Y or precomputed_matrix must be provided."):
    |                                                                                         ^^^^^^^^^^^ E501
382 |         cls.compute(X=None, Y=None, precomputed_matrix=None)
    |

sklearn/metrics/tests/test_pairwise_distances_reduction.py:386:89: E501 Line too long (91 > 88)
    |
384 | @pytest.mark.parametrize("cls", [ArgKmin, RadiusNeighbors])
385 | def test_precompute_all_inputs_provided(cls):
386 |     """Test that ValueError is raised when both X/Y and precomputed_matrix are provided."""
    |                                                                                         ^^^ E501
387 |     X = np.random.rand(10, 5)
388 |     Y = np.random.rand(10, 5)
    |

sklearn/metrics/tests/test_pairwise_distances_reduction.py:390:89: E501 Line too long (104 > 88)
    |
388 |     Y = np.random.rand(10, 5)
389 |     precomputed_matrix = np.random.rand(10, 10)
390 |     with pytest.raises(ValueError, match="Only one of X and Y or precomputed_matrix must be provided."):
    |                                                                                         ^^^^^^^^^^^^^^^^ E501
391 |         cls.compute(X=X, Y=Y, precomputed_matrix=precomputed_matrix)
    |

sklearn/metrics/tests/test_pairwise_distances_reduction.py:410:1: W293 [*] Blank line contains whitespace
    |
408 |     # Success Case: Valid precomputed matrix
409 |     n_samples_X, n_samples_Y = 5, 5
410 |     
    | ^^^^ W293
411 |     # Failure Case: Not a numpy array
412 |     with pytest.raises(AssertionError, match="Input must be a numpy array"):
    |
    = help: Remove whitespace from blank line

sklearn/metrics/tests/test_pairwise_distances_reduction.py:417:89: E501 Line too long (89 > 88)
    |
415 |     # Failure Case: Incorrect dtype
416 |     invalid_dtype = np.random.randint(0, 10, (n_samples_X, n_samples_Y))
417 |     with pytest.raises(AssertionError, match="Precomputed matrix must be of type float"):
    |                                                                                         ^ E501
418 |         assert_precomputed(invalid_dtype, n_samples_X, n_samples_Y)
    |

sklearn/metrics/tests/test_pairwise_distances_reduction.py:426:89: E501 Line too long (92 > 88)
    |
424 |     # Failure Case: Incorrect dimensions
425 |     incorrect_shape = np.random.rand(n_samples_X, n_samples_X).astype(np.float32)
426 |     with pytest.raises(AssertionError, match="Incorrect dimensions for precomputed matrix"):
    |                                                                                         ^^^^ E501
427 |         assert_precomputed(incorrect_shape, n_samples_X, n_samples_Y)
    |

sklearn/metrics/tests/test_pairwise_distances_reduction.py:576:1: W293 [*] Blank line contains whitespace
    |
574 |             **tols,
575 |         )
576 |         
    | ^^^^^^^^ W293
577 | @pytest.mark.parametrize("check_sorted", [True, False])
578 | def test_assert_compatible_radius_results(check_sorted):
    |
    = help: Remove whitespace from blank line

sklearn/metrics/tests/test_pairwise_distances_reduction.py:1712:23: W291 [*] Trailing whitespace
     |
1710 |         Y=Y,
1711 |         radius=radius,
1712 |         metric=metric,     
     |                       ^^^^^ W291
1713 |         weights=weights,
1714 |         Y_labels=Y_labels,
     |
     = help: Remove trailing whitespace

Found 10 errors.
[*] 3 fixable with the `--fix` option (1 hidden fix can be enabled with the `--unsafe-fixes` option).

Generated for commit: 22b8c02. Link to the linter CI: here

@Micky774
Copy link
Contributor

Also, in the abstract methods dist and surrogate_dist of PrecomputedDistanceMatrix, the indices [i,j] are required. Does this mean that other indices in the same dataset need computation and are not precomputed?

No. Keep in mind that surrogate_dist and dist methods are for scalar distances, i.e. they take a pair of indices and produce the distance between exactly those two data vectors. Hence, for the precomputed distances, it's simply indexing into the precomputed distances array.

@kyrajeep
Copy link
Author

Thank you for the clarification

@kyrajeep
Copy link
Author

In class PrecomputedDistanceMatrix, I don't see the option to input the precomputed distance matrix. Should I make that option or is it stored somewhere else?

@jjerphan
Copy link
Member

jjerphan commented Jul 15, 2024

Hi! I don't have much time and energy to have a look at it in details but basically:

  • a reference to the distance matrix needs to be passed to the compute methods of the dispatchers (which subclass BaseDistancesReductionDispatcher)
  • this reference needs to be passed then to the constructors of the Cython implementations (which subclass BaseDistancesReduction{{name_suffix}}) similarly to X and Y and be stored in an instance of a potentially new DatasetsPair subclass.
  • the logic for computing chunks size and other pieces of information in BaseDistancesReduction{{name_suffix}}.__init__ also needs adaptation.
  • we could have another second class hierarchy for handling precomputed distance matrices, but let's start with this simpler adaptation for now.

Let me know if this is unclear. 🙂

@jjerphan jjerphan changed the title Add the precomputed option to bypass the computation and call the precomputed distance instead. FEAT Support precomputed distance matrix for PairwiseDistancesReductions Jul 23, 2024
@kyrajeep
Copy link
Author

kyrajeep commented Aug 3, 2024

Thank you for your comments. They make sense, are helpful, and I believe I am going in the right direction. So far I have made progress on all the points made above by @jjerphan except for creating an instance to store the precomputed matrix. I think even though there are things (referencing the precomputed matrix and syntax) that should be definitely changed a review will be helpful and probably worthwhile at this point :) Higher level review will be helpful enough as I am familiar with the class hierarchy and codebase.

I have a couple questions that are not syntax related. We have a class for precomputed that subclasses DatasetsPair. In that class we have surrogate_dist and dist functions. My current understanding is that they are both used to compute distance depending on metric, chunking strategy, etc.. I want to create a method that will take in a precomputed matrix, if provided, and another method to retrieve this matrix to be passed into the compute method. Does that sound about right?

As for creating an instance to store the precomputed matrix itself... which file should I include it in?

Feedback will be appreciated! Thank you!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick review with more detailed explanations than my previous ones.

I pushed a commit (see 3a027c7) since your local venv_sklearn was staged and committed as part of e2ef246. I would recommend updating your global .gitignore with venv* to avoid this in the future, or to use a conda environment.

Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>
kyrajeep and others added 4 commits December 6, 2024 11:45
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@kyrajeep
Copy link
Author

kyrajeep commented Dec 7, 2024

We are nearly there! :)

Thank you for the reviews!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, run the linters and remove the unused fixture in the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, revert changes to this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need tests checking that the results of PairwiseDistancesReduction obtained respectively on (X, Y) or on their precomputed distance matrix are identical.

@Micky774
Copy link
Contributor

Catching up on this PR. I wanted to mention that I don't know if I like the current API with precomputed as a separate argument, since usually we include it as an option for metric. I think keeping with tradition leads to a more consistent experience since e.g. you don't have a situation where both precomputed and metric are defined (which would be entirely redundant) and instead succinctly have metric='precomputed'.

To enable this, we could simply enforce an API where metric='precomputed' expects X to be the precomputed distances, and raises an error if Y is provided (in the rare case that folks accidentally use precomputed when they didn't mean to).

What do you all think?

@jjerphan
Copy link
Member

Thank you for shiming in, @Micky774.

We can indeed revise the API to get them in line with the public ones.

@kyrajeep
Copy link
Author

Hi @Micky774,

I'm sorry to take a while to respond. That's a great point. I agree that including 'precomputed' in one of the options for metrics is clearer. How about making the change in a separate PR so that this PR does not get too lengthy? Thanks!

@Micky774
Copy link
Contributor

Hi @Micky774,

I'm sorry to take a while to respond. That's a great point. I agree that including 'precomputed' in one of the options for metrics is clearer. How about making the change in a separate PR so that this PR does not get too lengthy? Thanks!

Considering that it is a fairly key part of the API being affected by this PR, I think it is better to resolve it here. In my opinion it costs less to fix it now, than it would to start up a new PR for that change. And no worries on the delay -- I appreciate you continuing your work here!

@kyrajeep
Copy link
Author

Ok, thank you :)

@kyrajeep
Copy link
Author

Hi @jjerphan, @Micky774

While working on implementing the change in the API to enable 'precomputed' as one of the metrics, I realized that X = precomputed argument may lead to issues due to the dimensions being different from X. This is from the current DatasetsPair docstring:

X : {ndarray, sparse matrix} of shape (n_samples_X, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.
Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.
precomputed : ndarray of shape (n_samples_X, n_samples_Y)
Precomputed distance if provided.

What do you all think?

@Micky774
Copy link
Contributor

Hi @jjerphan, @Micky774

While working on implementing the change in the API to enable 'precomputed' as one of the metrics, I realized that X = precomputed argument may lead to issues due to the dimensions being different from X. This is from the current DatasetsPair docstring:

X : {ndarray, sparse matrix} of shape (n_samples_X, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. precomputed : ndarray of shape (n_samples_X, n_samples_Y) Precomputed distance if provided.

What do you all think?

I'm not sure I see the problem here. If you're worried about the fact that an NxN pre-computed matrix may be a result of NxD data for arbitrary D, then I do not believe it is a problem since the responsibility lies on the user to ensure that they are entering the correct precomputed matrix for their data -- that is not something we could (or should) verify imo.

If that's not what your concern is, then I may need a bit more of an explanation :)

@kyrajeep
Copy link
Author

kyrajeep commented Feb 26, 2025

Hi @jjerphan, @Micky774
While working on implementing the change in the API to enable 'precomputed' as one of the metrics, I realized that X = precomputed argument may lead to issues due to the dimensions being different from X. This is from the current DatasetsPair docstring:
X : {ndarray, sparse matrix} of shape (n_samples_X, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. precomputed : ndarray of shape (n_samples_X, n_samples_Y) Precomputed distance if provided.
What do you all think?

I'm not sure I see the problem here. If you're worried about the fact that an NxN pre-computed matrix may be a result of NxD data for arbitrary D, then I do not believe it is a problem since the responsibility lies on the user to ensure that they are entering the correct precomputed matrix for their data -- that is not something we could (or should) verify imo.

If that's not what your concern is, then I may need a bit more of an explanation :)

Thank you for your input, @Micky774 ! That's exactly what I meant, but shouldn't we verify the shape of the precomputed matrix to throw an error if the dimension is not correct? What do you think, @jjerphan ? :)

@jjerphan
Copy link
Member

I would look at and implement the same logic as the one of estimator taking a data matrix or a pre-computed distance matrix for X.

@kyrajeep
Copy link
Author

Sounds good. Thank you.

is_usable = False

# is_usable = (X is not None and Y is not None) ^ bool(precomputed)
if is_usable == False:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't is_usable be undefined if metric != "precomputed"?

@@ -1622,7 +1709,7 @@ def test_radius_neighbors_classmode_strategy_consistent(outlier_label):
X=X,
Y=Y,
radius=radius,
metric=metric,
metric=metric,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metric=metric,
metric=metric,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants