ENH Euclidean specialization of `DatasetsPair` instead of `ArgKmin` and `RadiusNeighbors` #25170

Vincent-Maladiere · 2022-12-12T08:13:44Z

Reference Issues/PRs

Follow up of #25044

Redesign DatasetsPair w.r.t the new MiddleTermComputer to remove the duplicated logic and to clarify responsibilities > (esp. squared euclidean norm computations)

What does this implement/fix? Explain your changes.

This PR removes the euclidean specialization logic from EuclideanArgKmin and EuclideanRadiusNeighbors to Euclidean{DenseDense, SparseSparse}DatasetsPair.

This is how EuclideanArgKmin and EuclideanRadiusNeighbors are currently tied to MiddleTermComputer:

This is how this PR suggests removing EuclideanArgKmin and EuclideanRadiusNeighbors and introducing Euclidean{DenseDense, SparseSparse}DatasetsPair instead:

Any other comments?

Done

DatasetPairs is instantiated during the __init__ of BaseDistancesReduction because some parameters computed during the latter are needed in the former.
{DenseDense, SparseSparse}MiddleTermComputer are instantiated directly during the __init__ of Euclidean{DenseDense, SparseSparse}DatasetsPair, removing the need for a get_for classmethod to dispatch cases in MiddleTermComputer
parallel_on_{X, Y}_pre_compute_and_reduce_distances_on_chunks() in {ArgKmin, RadiusNeighbors} computes and stores dist_middle_term within MiddleTermComputer.
Calling ArgKmin.surrogate_dist() or RadiusNeighbors.surrogate_dist() performs a call to DatasetsPair and then to MiddleTermComputer to get the dist_middle_term quantity.

TODO

Make all tests pass (a lot of fails for now)
Add some new tests as suggested in ENH Add the fused CSR dense case for Euclidean Specializations #25044

cc @jjerphan (and @Arnaud15 who is interested in this PR).

See this note for more details on Euclidean Specialization (in french).

jjerphan

Thanks @Vincent-Maladiere for this significant rework! 💯

The design looks good and implementable at first sight.

I think we better check for any performance regression now before proceeding. In this regards, I think scikit-learn/pairwise-distances-reductions-asv-suite can help you.

I recommend reducing the parametrisation of the benchmark to cover:

metric="euclidean"
strategy="auto"
dtype in [float32, float64]
(X_train, X_test) in (("dense", "dense"), ("csr", "csr"))
return_distance=True

and running them only for time_ArgKmin, which can done using:

asv continuous --verbose --show-stderr --split \
    --bench "PairwiseDistancesReductionsBenchmark.time_ArgKmin" \
    main refs/pull/25170/head

On a machine using 128 physical cores, this might takes easily 10 hours to run (because the timeout default values has been changed from 500 to 1000).

You might want to start with quick runs to check for any problems or for a rough assessment of performance assessement. For this, I recommend:

running the benchmark on pairs of datasets which small up to medium number of samples (i.e. 1000 up to 100_000)
adding the --quick option for the asv continuous command above to only run each combinations only once.
to decrease the timeout down to a few hundreds seconds

In the meantime, here are a few comment on this first iteration.

jjerphan · 2022-12-12T08:28:03Z

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

@@ -417,8 +350,12 @@ cdef class SparseSparseMiddleTermComputer{{name_suffix}}(MiddleTermComputer{{nam

    def __init__(
        self,
-        X,
-        Y,
+        const DTYPE_t[:] X_data,


Suggested change

const DTYPE_t[:] X_data,

const {{INPUT_DTYPE_t}}[:] X_data,

jjerphan · 2022-12-12T08:28:14Z

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

+        const DTYPE_t[:] X_data,
+        const SPARSE_INDEX_TYPE_t[:] X_indices,
+        const SPARSE_INDEX_TYPE_t[:] X_indptr,
+        const DTYPE_t[:] Y_data,


Suggested change

const DTYPE_t[:] Y_data,

const {{INPUT_DTYPE_t}}[:] Y_data,

Thanks, this is now fixed.

This one is interesting, though, because in the main branch we currently cast X_data and Y_data to float64, and we call the routine sparse_sparse_middle_term_computation_64 for both SparseMiddleTermComputer{32, 64}.

I saw that you changed this behavior in your work on SparseDenseDatasetsPair to avoid casting. Could you provide more details about this decision?

jjerphan · 2022-12-12T08:33:08Z