-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC Should pairwise_distances preserve float32 ? #24502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If possible, I think if
For me, using a float64 accumulator is a requirement if we want to preserve float32. |
I agree with @thomasjpfan. |
Following discussions in: scikit-learn#24502
+1 as well. I have the feeling that internal |
So I guess everyone agrees on dtypes' preservation. #23958 has been updated accordingly and would treat most cases once merged. |
Currently the dtype of the distance matrix returned by
pairwise_distances
is not very consistent, depending on the metric and on the value of n_jobs.For float64 input, everything is consistent: the returned matrix is always in float64.
For mixed float64 X and float32 Y, the return matrix is also always in float64 and this is what should be expected imo.
The troubles come when both X and Y are float32.
euclidean
andcosine
: result is always float32manhattan
: result is float64 if n_jobs=1 and float32 otherwiseNote that scipy cdist/pdist always returns float64.
Hence the question: should
pairwise_distances
preserve float32 ?My opinion is that it should since
pairwise_distances
can be used as an intermediate step during fit and since there's ongoing work towards preserving float32 in estimators (see #11000 for transfromers for instance).An argument against that could be reducing the numerical instabilities. A potential solution could be to use float64 accumulators for the intermediate computations only, still returning a float32 dist matrix. Note that with #23958 we might not need to use the scipy metrics anymore, in favor of the ones defined in
dist_metrics
, and using float64 accumulators would be easier to implement generally.Answering this question will help to not go in the wrong direction in #23958
The text was updated successfully, but these errors were encountered: