Skip to content

Inconsistent check_pairwise_arrays in pairwise_distances #31162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lucyleeow opened this issue Apr 9, 2025 · 6 comments
Closed

Inconsistent check_pairwise_arrays in pairwise_distances #31162

lucyleeow opened this issue Apr 9, 2025 · 6 comments

Comments

@lucyleeow
Copy link
Member

lucyleeow commented Apr 9, 2025

Note: This is not necessarily a 'problem' to be fixed but it could be better documented at least.

In pairwise_distances, check_pairwise_arrays is not run if metric is a callable or a metric in PAIRWISE_DISTANCE_FUNCTIONS.
check_pairwise_arrays notably ensures the inputs are 2D and performs check_array, which does more than I fully understand, so I can't confidently comment on whether it is 'needed'.

Most PAIRWISE_DISTANCE_FUNCTIONS metrics run check_pairwise_arrays themselves but the following do not:

  • haversine_distances (this is implemented in Cython but does seem to check that X is 2D)
  • cosine_distances

For user provided metric callables, I think we leave it up to the user to ensure input is correct. I found this comment (#27456 (comment)) in and old PR, about leaving it to the user to ensure the input is 2D.

We also do have the note:

# Only check the number of features if 2d arrays are enforced. Otherwise,
# validation is left to the user for custom metrics.

I am not sure this comment is in the best place, but that is an easy fix. check_array is also not performed for user provided metric callables, but maybe this is not 'needed' ?

cc @jeremiedbb and maybe @adrinjalali ?

Context: noticed while working on #29822

@github-actions github-actions bot added the Needs Triage Issue requires triage label Apr 9, 2025
@jeremiedbb
Copy link
Member

In pairwise_distances, check_pairwise_arrays is not run if metric is a callable or a metric in PAIRWISE_DISTANCE_FUNCTIONS.

For metrics in PAIRWISE_DISTANCE_FUNCTIONS, it delegates the check to the metric functions since they already do it. Otherwise it would result in a double validation.

Most PAIRWISE_DISTANCE_FUNCTIONS metrics run check_pairwise_arrays themselves but the following do not:
haversine_distances (this is implemented in Cython but does seem to check that X is 2D)
cosine_distances

cosine_distances does because it calls cosine_similarity which calls check_pairwise_arrays.

haversine_distances doesn't indeed. The cython implementation calls check array separately. To me the python haversine_distances function should run the validation as well as all other public functions or classes directly calling DistanceMetrics, but that goes beyond just the haversine distances then and I don't know how many functions are concerned. The current state is probably good enough and don't seem to have been problematic so far.

check_array is also not performed for user provided metric callables, but maybe this is not 'needed' ?

Yes, in case of a callable, all validations are left to the callable. It's done on purpose to allow for use cases that we don't have within sklearn but may be useful more generally like distances between strings.

@jeremiedbb jeremiedbb removed the Needs Triage Issue requires triage label Apr 9, 2025
@jeremiedbb
Copy link
Member

To summarize, I think the status of check_pairwise_arrays in pairwise_distances is in accordance to our expectations and that we can close this.

@lucyleeow
Copy link
Member Author

Sure but we could at least leave a comment in the code about:

Yes, in case of a callable, all validations are left to the callable.

@jeremiedbb
Copy link
Member

Indeed adding a comment would not hurt :)

@jeremiedbb
Copy link
Member

Yes, in case of a callable, all validations are left to the callable. It's done on purpose to allow for use cases that we don't have within sklearn but may be useful more generally like distances between strings.

That was wrong. We still call check_pairwise_arrays (in _pairwise_callable). We just don't force 2D arrays here.

So in the end the use of check_pairwise_arrays is pretty consistent. There's only haversine that is special and implements its own validation but I think it's good enough.

@lucyleeow
Copy link
Member Author

lucyleeow commented Apr 10, 2025

Yeah I saw that later too. Agreed, it's enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants