Skip to content

RFC Should pairwise_distances preserve float32 ? #24502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jeremiedbb opened this issue Sep 23, 2022 · 4 comments
Open

RFC Should pairwise_distances preserve float32 ? #24502

jeremiedbb opened this issue Sep 23, 2022 · 4 comments

Comments

@jeremiedbb
Copy link
Member

jeremiedbb commented Sep 23, 2022

Currently the dtype of the distance matrix returned by pairwise_distances is not very consistent, depending on the metric and on the value of n_jobs.

For float64 input, everything is consistent: the returned matrix is always in float64.
For mixed float64 X and float32 Y, the return matrix is also always in float64 and this is what should be expected imo.

The troubles come when both X and Y are float32.

  • for sklearn metrics:
    • euclidean and cosine: result is always float32
    • manhattan: result is float64 if n_jobs=1 and float32 otherwise
  • for scipy metrics: result is float64 if n_jobs=1 and float32 otherwise
    Note that scipy cdist/pdist always returns float64.

Hence the question: should pairwise_distances preserve float32 ?

My opinion is that it should since pairwise_distances can be used as an intermediate step during fit and since there's ongoing work towards preserving float32 in estimators (see #11000 for transfromers for instance).

An argument against that could be reducing the numerical instabilities. A potential solution could be to use float64 accumulators for the intermediate computations only, still returning a float32 dist matrix. Note that with #23958 we might not need to use the scipy metrics anymore, in favor of the ones defined in dist_metrics, and using float64 accumulators would be easier to implement generally.

Answering this question will help to not go in the wrong direction in #23958

@thomasjpfan
Copy link
Member

If possible, I think if X and Y are both float32 then the return should be float32.

A potential solution could be to use float64 accumulators for the intermediate computations only, still returning a float32 dist matrix.

For me, using a float64 accumulator is a requirement if we want to preserve float32.

@jjerphan
Copy link
Member

I agree with @thomasjpfan.

jjerphan added a commit to jjerphan/scikit-learn that referenced this issue Sep 27, 2022
@ogrisel
Copy link
Member

ogrisel commented Sep 27, 2022

If possible, I think if X and Y are both float32 then the return should be float32.

+1 as well. I have the feeling that internal float64 accumulator is unlikely to hold more meaningfully precise information than its float32-downcast version once the computation has of the distances has finished. But maybe I am wrong, if so I would be happy to see a counter example and I would update my vote accordingly.

@jjerphan
Copy link
Member

jjerphan commented Sep 27, 2022

So I guess everyone agrees on dtypes' preservation.

#23958 has been updated accordingly and would treat most cases once merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants