Skip to content

FEAT Support precomputed distance matrix for PairwiseDistancesReductions #29483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 82 commits into
base: main
Choose a base branch
from

Conversation

kyrajeep
Copy link

@kyrajeep kyrajeep commented Jul 15, 2024

Towards #25888

  1. What does this implement/fix? Explain your changes.
    (1) To take in precomputed distance directly, uses 'compute' in {ArgKmin, RadiusNeighbors}{ClassMode} {32,64}.
    (2) When this is received as a parameter X, initializes an instance of class PrecomputedDistance.
    (3) This precomputed data is formatted by datasets_pair.py
    (4) The data is passed onto the backend by using the get_for() factory method.
    (5) Changes the API according to @Micky774 's suggestion, which is to include 'precomputed' as one of the metrics instead of as an additional parameter. This simplifies its use by taking a precomputed matrix as X and avoids confusion for the users who do not have a precomputed matrix.
    For example, a user with a precomputed matrix specifies metric = 'precomputed' and X = precomputed matrix. Previously, there was an extra 'precomputed' parameter that took a Boolean value.

  2. Any other comments?
    Thank you to @jjerphan for the reviews!

Co-authors: @jjerphan @adam2392

Copy link

github-actions bot commented Jul 15, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff check

ruff detected issues. Please run ruff check --fix --output-format=full locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:119:12: E712 Avoid equality comparisons to `False`; use `if not is_usable:` for false checks
    |
118 |         # is_usable = (X is not None and Y is not None) ^ bool(precomputed)
119 |         if is_usable == False:
    |            ^^^^^^^^^^^^^^^^^^ E712
120 |             return is_usable
121 |         # FIXME: the current Cython implementation is too slow for a large number of
    |
    = help: Replace with `not is_usable`

sklearn/metrics/tests/test_pairwise_distances_reduction.py:408:89: E501 Line too long (96 > 88)
    |
406 | @pytest.mark.parametrize("cls", [ArgKmin, RadiusNeighbors])
407 | def test_precompute_invalid_metric(cls):
408 |     """Test that ValueError is raised when metric is not 'precomputed' for precomputed input."""
    |                                                                                         ^^^^^^^^ E501
409 |     X = np.random.rand(10, 10)  # Precomputed matrix
410 |     with pytest.raises(
    |

sklearn/metrics/tests/test_pairwise_distances_reduction.py:471:5: F811 Redefinition of unused `assert_precomputed` from line 106
    |
471 | def assert_precomputed(precomputed, n_samples_X, n_samples_Y):
    |     ^^^^^^^^^^^^^^^^^^ F811
472 |     """
473 |     Validates a precomputed matrix for compatibility.
    |
    = help: Remove definition: `assert_precomputed`

sklearn/metrics/tests/test_pairwise_distances_reduction.py:499:89: E501 Line too long (112 > 88)
    |
497 |     if precomputed.shape != expected_shape:
498 |         raise AssertionError(
499 |             f"Incorrect dimensions for precomputed matrix. Expected: {expected_shape}, Got: {precomputed.shape}"
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^ E501
500 |         )
    |

Found 4 errors.
No fixes available (1 hidden fix can be enabled with the `--unsafe-fixes` option).

ruff format

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


--- sklearn/metrics/tests/test_pairwise_distances_reduction.py
+++ sklearn/metrics/tests/test_pairwise_distances_reduction.py
@@ -378,11 +378,6 @@
 }
 
 
-
-
-
-
-
 @pytest.mark.parametrize("cls", [ArgKmin, RadiusNeighbors])
 def test_precompute_all_inputs_none(cls):
     """Test that ValueError is raised when all inputs are None."""

1 file would be reformatted, 917 files already formatted

mypy

mypy detected issues. Please fix them locally and push the changes. Here you can see the detected issues. Note that the installed mypy version is mypy=1.15.0.


sklearn/metrics/tests/test_pairwise_distances_reduction.py:471: error: Name "assert_precomputed" already defined on line 106  [no-redef]
Found 1 error in 1 file (checked 558 source files)

Generated for commit: 15319c9. Link to the linter CI: here

@Micky774
Copy link
Contributor

Also, in the abstract methods dist and surrogate_dist of PrecomputedDistanceMatrix, the indices [i,j] are required. Does this mean that other indices in the same dataset need computation and are not precomputed?

No. Keep in mind that surrogate_dist and dist methods are for scalar distances, i.e. they take a pair of indices and produce the distance between exactly those two data vectors. Hence, for the precomputed distances, it's simply indexing into the precomputed distances array.

@kyrajeep
Copy link
Author

Thank you for the clarification

@kyrajeep
Copy link
Author

In class PrecomputedDistanceMatrix, I don't see the option to input the precomputed distance matrix. Should I make that option or is it stored somewhere else?

@jjerphan
Copy link
Member

jjerphan commented Jul 15, 2024

Hi! I don't have much time and energy to have a look at it in details but basically:

  • a reference to the distance matrix needs to be passed to the compute methods of the dispatchers (which subclass BaseDistancesReductionDispatcher)
  • this reference needs to be passed then to the constructors of the Cython implementations (which subclass BaseDistancesReduction{{name_suffix}}) similarly to X and Y and be stored in an instance of a potentially new DatasetsPair subclass.
  • the logic for computing chunks size and other pieces of information in BaseDistancesReduction{{name_suffix}}.__init__ also needs adaptation.
  • we could have another second class hierarchy for handling precomputed distance matrices, but let's start with this simpler adaptation for now.

Let me know if this is unclear. 🙂

@jjerphan jjerphan changed the title Add the precomputed option to bypass the computation and call the precomputed distance instead. FEAT Support precomputed distance matrix for PairwiseDistancesReductions Jul 23, 2024
@kyrajeep
Copy link
Author

kyrajeep commented Aug 3, 2024

Thank you for your comments. They make sense, are helpful, and I believe I am going in the right direction. So far I have made progress on all the points made above by @jjerphan except for creating an instance to store the precomputed matrix. I think even though there are things (referencing the precomputed matrix and syntax) that should be definitely changed a review will be helpful and probably worthwhile at this point :) Higher level review will be helpful enough as I am familiar with the class hierarchy and codebase.

I have a couple questions that are not syntax related. We have a class for precomputed that subclasses DatasetsPair. In that class we have surrogate_dist and dist functions. My current understanding is that they are both used to compute distance depending on metric, chunking strategy, etc.. I want to create a method that will take in a precomputed matrix, if provided, and another method to retrieve this matrix to be passed into the compute method. Does that sound about right?

As for creating an instance to store the precomputed matrix itself... which file should I include it in?

Feedback will be appreciated! Thank you!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick review with more detailed explanations than my previous ones.

I pushed a commit (see 3a027c7) since your local venv_sklearn was staged and committed as part of e2ef246. I would recommend updating your global .gitignore with venv* to avoid this in the future, or to use a conda environment.

Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>
kyrajeep and others added 2 commits December 6, 2024 11:46
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@kyrajeep
Copy link
Author

kyrajeep commented Dec 7, 2024

We are nearly there! :)

Thank you for the reviews!

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, run the linters and remove the unused fixture in the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, revert changes to this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need tests checking that the results of PairwiseDistancesReduction obtained respectively on (X, Y) or on their precomputed distance matrix are identical.

@Micky774
Copy link
Contributor

Catching up on this PR. I wanted to mention that I don't know if I like the current API with precomputed as a separate argument, since usually we include it as an option for metric. I think keeping with tradition leads to a more consistent experience since e.g. you don't have a situation where both precomputed and metric are defined (which would be entirely redundant) and instead succinctly have metric='precomputed'.

To enable this, we could simply enforce an API where metric='precomputed' expects X to be the precomputed distances, and raises an error if Y is provided (in the rare case that folks accidentally use precomputed when they didn't mean to).

What do you all think?

@jjerphan
Copy link
Member

Thank you for shiming in, @Micky774.

We can indeed revise the API to get them in line with the public ones.

@kyrajeep
Copy link
Author

Hi @Micky774,

I'm sorry to take a while to respond. That's a great point. I agree that including 'precomputed' in one of the options for metrics is clearer. How about making the change in a separate PR so that this PR does not get too lengthy? Thanks!

@Micky774
Copy link
Contributor

Hi @Micky774,

I'm sorry to take a while to respond. That's a great point. I agree that including 'precomputed' in one of the options for metrics is clearer. How about making the change in a separate PR so that this PR does not get too lengthy? Thanks!

Considering that it is a fairly key part of the API being affected by this PR, I think it is better to resolve it here. In my opinion it costs less to fix it now, than it would to start up a new PR for that change. And no worries on the delay -- I appreciate you continuing your work here!

@kyrajeep
Copy link
Author

Ok, thank you :)

@kyrajeep
Copy link
Author

Hi @jjerphan, @Micky774

While working on implementing the change in the API to enable 'precomputed' as one of the metrics, I realized that X = precomputed argument may lead to issues due to the dimensions being different from X. This is from the current DatasetsPair docstring:

X : {ndarray, sparse matrix} of shape (n_samples_X, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.
Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.
precomputed : ndarray of shape (n_samples_X, n_samples_Y)
Precomputed distance if provided.

What do you all think?

@Micky774
Copy link
Contributor

Hi @jjerphan, @Micky774

While working on implementing the change in the API to enable 'precomputed' as one of the metrics, I realized that X = precomputed argument may lead to issues due to the dimensions being different from X. This is from the current DatasetsPair docstring:

X : {ndarray, sparse matrix} of shape (n_samples_X, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. precomputed : ndarray of shape (n_samples_X, n_samples_Y) Precomputed distance if provided.

What do you all think?

I'm not sure I see the problem here. If you're worried about the fact that an NxN pre-computed matrix may be a result of NxD data for arbitrary D, then I do not believe it is a problem since the responsibility lies on the user to ensure that they are entering the correct precomputed matrix for their data -- that is not something we could (or should) verify imo.

If that's not what your concern is, then I may need a bit more of an explanation :)

@kyrajeep
Copy link
Author

kyrajeep commented Feb 26, 2025

Hi @jjerphan, @Micky774
While working on implementing the change in the API to enable 'precomputed' as one of the metrics, I realized that X = precomputed argument may lead to issues due to the dimensions being different from X. This is from the current DatasetsPair docstring:
X : {ndarray, sparse matrix} of shape (n_samples_X, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features) Input data. If provided as a ndarray, it must be C-contiguous. If provided as a sparse matrix, it must be in CSR format. precomputed : ndarray of shape (n_samples_X, n_samples_Y) Precomputed distance if provided.
What do you all think?

I'm not sure I see the problem here. If you're worried about the fact that an NxN pre-computed matrix may be a result of NxD data for arbitrary D, then I do not believe it is a problem since the responsibility lies on the user to ensure that they are entering the correct precomputed matrix for their data -- that is not something we could (or should) verify imo.

If that's not what your concern is, then I may need a bit more of an explanation :)

Thank you for your input, @Micky774 ! That's exactly what I meant, but shouldn't we verify the shape of the precomputed matrix to throw an error if the dimension is not correct? What do you think, @jjerphan ? :)

@jjerphan
Copy link
Member

I would look at and implement the same logic as the one of estimator taking a data matrix or a pre-computed distance matrix for X.

@kyrajeep
Copy link
Author

Sounds good. Thank you.

is_usable = False

# is_usable = (X is not None and Y is not None) ^ bool(precomputed)
if is_usable == False:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't is_usable be undefined if metric != "precomputed"?

@@ -1622,7 +1709,7 @@ def test_radius_neighbors_classmode_strategy_consistent(outlier_label):
X=X,
Y=Y,
radius=radius,
metric=metric,
metric=metric,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metric=metric,
metric=metric,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants