ENH Private Cython Submodule for Reduction over Pairwise Distances #20254

jjerphan · 2021-06-11T17:07:05Z

⚠ This PR has been superseded by #21462

What does this implement/fix? Explain your changes.

Aggregations over pairwise distances (argmin, argkmin, threshold filtering, cumulative sum, etc.) are one of the foundations used by many algorithms within scikit-learn.

Various strategies exist to compute them different, including some using chunks (see sklearn/metrics/pairwise.py). However, those are still mainly implemented at the level of Python leaving some rooms for manœuvre, especially for potential optimizations which can be made at a lower level.

This PR explores new implementations for such aggregations using Cython to have a finer control over computations (similarly to what has been explored and implemented for K-means).

Importantly, the core of the computation is chunked to make sure that the pairwise distance chunk matrices stay in CPU cache before applying the final reduction step. This can lead to several x speed-ups alone. Furthermore, the chunking strategy is also used to leverage OpenMP-based paralellism (using Cython prange loop) which gives another multiplicative speed-up in favorable cases on many-core machines.

Interfaces benefiting from this PR

Core interfaces:

sklearn.neighbors.KNeighborsMixin.kneighbors
sklearn.neighbors.RadiusNeighborsMixin.radius_neighbors
sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances_argmin_min

User facing interfaces (might not be exhaustive):

sklearn.cluster.Birch
sklearn.cluster.DBSCAN
sklearn.cluster.MeanShift
sklearn.cluster.OPTICS
sklearn.cluster.SpectralClustering
sklearn.feature_selection.mutual_info_regression
sklearn.neighbors.AffinityPropagation
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.KNeighborsRegressor
sklearn.neighbors.LocalOutlierFactor
sklearn.neighbors.NearestNeighbors
sklearn.manifold.Isomap
sklearn.manifold.LocallyLinearEmbedding
sklearn.manifold.TSNE
sklearn.manifold.trustworthiness
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading

Last Benchmarks

Benchmarks are made via https://github.com/jjerphan/scikit-learn/tree/benchmarks/pairwise_aggregation_cython.

For kneighbors(1.2× to 20× speed-up): #20254 (comment)

fast_sqeuclidean metric

Ran with:

asv continuous -b BruteForceNearestNeighborsBenchmark -e -q  main benchmarks/pairwise_aggregation_cython

       before           after         ratio                                                                                                                                                        
     [df20e815]       [3adec08f]                                                                                                                                                                   
     <main>           <benchmarks/pairwise_aggregation_cython>                                          n_train, n_test, n_features, (k, radius)
-         3.51±0s          2.99±0s     0.85  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1000, 1000))                                                       
-         1.73±0s          1.43±0s     0.83  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (100, 100))                                                   
-         1.72±0s          1.39±0s     0.81  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1000, 1000))                                                 
-         1.77±0s          1.41±0s     0.80  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (10, 10))                                                     
-         1.12±0s          865±0ms     0.77  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1, 1))                                                       
-         1.08±0s          760±0ms     0.70  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (100, 100))                                                    
-         1.13±0s          746±0ms     0.66  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1, 1))                                                       
-         1.18±0s          776±0ms     0.66  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (10, 10))                                                     
-         1.05±0s          679±0ms     0.64  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1000, 1000))                                                  
-         2.55±0s          1.14±0s     0.45  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (100, 100))                                                         
-         2.89±0s          1.22±0s     0.42  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1000, 1000))                                                       
-         2.32±0s          792±0ms     0.34  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1000, 1000))                                                 
-         2.64±0s          881±0ms     0.33  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1, 1))                                                             
-         2.09±0s          666±0ms     0.32  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (10, 10))                                                      
-         2.30±0s          720±0ms     0.31  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1, 1))                                                        
-         2.51±0s          763±0ms     0.30  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (100, 100))                                                   
-         4.37±0s          959±0ms     0.22  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (10, 10))                                                           
-         1.42±0s          226±0ms     0.16  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1, 1))                                                             
-         6.31±0s          974±0ms     0.15  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1000, 1000))                                                        
-         1.93±0s          226±0ms     0.12  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (10, 10))                                                           
-         2.23±0s          225±0ms     0.10  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (100, 100))                                                          
-         3.98±0s          327±0ms     0.08  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (100, 100))                                                         
-         2.93±0s          125±0ms     0.04  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (10, 10))                                                            
-         3.41±0s          141±0ms     0.04  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1, 1))                                                              
                                                                                                                                                                                                   
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.  
PERFORMANCE INCREASED.

Standard metric (here manhattan )

BruteForceNearestNeighborsBenchmark as been modified on top of in 3adec08 to use metric='manhattan' and ran with:

asv continuous -b BruteForceNearestNeighborsBenchmark -e -q  main benchmarks/pairwise_aggregation_cython

       before           after         ratio
     [df20e815]       [3adec08f]
     <main>           <benchmarks/pairwise_aggregation_cython>
-         2.19±0s          1.78±0s     0.81  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (10, 10))
-         3.35±0s          2.37±0s     0.71  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (100, 100))
-         2.45±0s          1.62±0s     0.66  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1, 1))
-         2.94±0s          1.88±0s     0.64  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1000, 1000))
-         6.97±0s          3.97±0s     0.57  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1, 1))
-         8.15±0s          3.85±0s     0.47  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (10, 10))
-         5.92±0s          2.78±0s     0.47  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (100, 100))
-         7.46±0s          3.39±0s     0.45  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1000, 1000))
-         2.62±0s          1.15±0s     0.44  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1, 1))
-         4.21±0s          1.80±0s     0.43  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (100, 100))
-         6.16±0s          2.30±0s     0.37  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (10, 10))
-         7.97±0s          2.84±0s     0.36  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1000, 1000))
-         45.1±0s          15.7±0s     0.35  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1000, 1000))
-         43.6±0s          14.9±0s     0.34  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (100, 100))
-         41.4±0s          13.8±0s     0.33  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (10, 10))
-         43.4±0s          13.7±0s     0.32  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1, 1))
-         8.19±0s          2.54±0s     0.31  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1, 1))
-         8.39±0s          2.55±0s     0.30  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (100, 100))
-         44.4±0s          13.4±0s     0.30  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1, 1))
-         42.7±0s          12.5±0s     0.29  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (10, 10))
-         45.5±0s          13.2±0s     0.29  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (100, 100))
-         11.2±0s          3.21±0s     0.29  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1000, 1000))
-         44.3±0s          11.7±0s     0.26  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1000, 1000))
-         8.88±0s          2.28±0s     0.26  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (10, 10))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Any other comments?

The initial experiments have been in scikit-learn-inria-fondation/pdist_aggregation

Subsequent work include:

the adaptation for sparse datasets implementations by @mbatoul is still WIP in [WIP] ENH Implement sparse support jjerphan/scikit-learn#4
the adaptation using optimized interfaces of algorithms internals such as:
- cluster._optics.compute_core_distances
- setting return_distance=False where applicable on sklearn.neighbors.KNeighborsMixin.kneighbors and sklearn.neighbors.RadiusNeighborsMixin.radius_neighbors calls.
- specialize ArgKmin(k=1) under ArgMin
- define PairwiseDistancesReduction for KMeans and adapt its implementation accordingly
- row norms of X_train once and attach it to KNeighborsMixins
compare OpenMP scheduling strategies (especially 'static' vs 'guided')

Main changes:

sklearn/neighbors/tests/test_neighbors.py

Move neighbors.NeighborsHeap's code and _typedefs under sklearn.utils as cyclic imports are currently happening between sklearn.neighbors and sklearn.metrics. Also, using integral in some cases gave unexpected results. Occurences were changed to use np.int_p, as exposed by utils._typedefs.ITYPE_t (we don't need signed integer)

jjerphan · 2021-06-23T16:18:56Z

(Small note: force-pushing allow me here to have a somewhat clean history during adaptation prior to a first review).

So that the range correspond to actual datasets and not to datasets whose marginal spreads are in [0, 1].

This test segfaults: test_neighbors.py::test_fast_sqeuclidean_correctness[1-10-5-1000]

The segfaults was due to reallocation of on the same pointers, causing multiple freeing on the same reference and memory leaks. To resolve this, arrays of pointers for local datastructures are allocated at the initialisation of the interface so that they can be handled separately in threads with proper allocation and deallocation. The memory management will be wrapped in subsequent private template method for each types of reduction and parallelisation strategy. This is one of the next iteration.

Refactor ArgKmin._reduce_on_chunks to pave the way to general interface for reductions. Private datastructures will have to be accessed via the implementation of this private method.

Introduce ParallelReduction as a abstract class, and extend it using ArgKmin. FastSquaredEuclideanArgKmin extends ArgKmin for the "fast_sqeuclidean" strategy.

The associated _typedefs.pyx file has been moved to utils to avoid circular dependencies has it is being used in neighbors.

jeremiedbb

the rest of the comments. I really like test_pairwise_distances_reduction

sklearn/utils/__init__.py

sklearn/utils/_heap.pyx

sklearn/utils/_testing.py

sklearn/neighbors/tests/test_neighbors.py

sklearn/metrics/tests/test_pairwise_distances_reduction.py

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

jjerphan · 2021-10-18T16:37:44Z

To re-answer @thomasjpfan's question (still present in a thread but hardly accessible because of new activity and impossibility to access it via a link):

Does self.distance_metric.rdist use a v-table look up? (I am curious. This may not be actionable)

Yes, it does. I do not think we can both have design flexibility without v-tables.

V-table implementation details

a struct is defined for the base class (here DistanceMetric):

struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric {
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*csr_dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*csr_rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
  int (*pdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice);
  int (*cdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*_rdist_to_dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*_dist_to_rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t);
};

a struct is created for each subclass (for instance here EuclideanDistance) and wraps the original struct as __pyx_base.

struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_EuclideanDistance {
  struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric __pyx_base;
};
static struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;

subclasses' methods are converted to functions (not shown bellow), __pyx_base get set as the base and subclasses' methods get bounds to the table:

  __pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance = &__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base = *__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_DistanceMetric;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base.dist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance_dist;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base.rdist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance_rdist;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base._rdist_to_dist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance__rdist_to_dist;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base._dist_to_rdist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance__dist_to_rdist;

when a subclass object is created, its __pyx_base v-tab is set to the one of its class.

static PyObject *__pyx_tp_new_7sklearn_7metrics_13_dist_metrics_EuclideanDistance(PyTypeObject *t, PyObject *a, PyObject *k) {
  struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *p;
  #if CYTHON_COMPILING_IN_LIMITED_API
  newfunc new_func = (newfunc)PyType_GetSlot(__pyx_ptype_7sklearn_7metrics_13_dist_metrics_DistanceMetric, Py_tp_new);
  PyObject *o = new_func(t, a, k);
  #else
  PyObject *o = __pyx_tp_new_7sklearn_7metrics_13_dist_metrics_DistanceMetric(t, a, k);
  #endif
  if (unlikely(!o)) return 0;
  p = ((struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *)o);
  p->__pyx_base.__pyx_vtab = (struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric*)__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
  return o;
}

on call sites, the dispatch is done via look-ups, e.g. on the code you commented:

  /* "sklearn/metrics/_dist_metrics.pyx":1334
 *     @final
 *     cdef DTYPE_t proxy_dist(self, ITYPE_t i, ITYPE_t j) nogil:
 *         return self.distance_metric.rdist(&self.X[i, 0],             # <<<<<<<<<<<<<<
 *                                           &self.Y[j, 0],
 *                                           self.d)
 */
  __pyx_t_5 = ((struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric *)__pyx_v_self->__pyx_base.distance_metric->__pyx_vtab)->rdist(__pyx_v_self->__pyx_base.distance_metric, (&(*((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=1 */ ((char *) (((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=0 */ (__pyx_v_self->X.data + __pyx_t_1 * __pyx_v_self->X.strides[0]) )) + __pyx_t_2)) )))), (&(*((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=1 */ ((char *) (((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=0 */ (__pyx_v_self->Y.data + __pyx_t_3 * __pyx_v_self->Y.strides[0]) )) + __pyx_t_4)) )))), __pyx_v_self->d); if (unlikely(__pyx_t_5 == ((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t)-1.0))) __PYX_ERR(1, 1334, __pyx_L1_error)
  __pyx_r = __pyx_t_5;
  goto __pyx_L0;

I do not know much about C compilers, but are static-qualified definitions helping with optimizations?

thomasjpfan

I am not done yet, there are so many things to think about with this PR.

Do you think we can split the PairwiseDistancesRadiusNeighborhood out into a follow up PR? This way we can focus on PairwiseDistancesReduction and reduce the diff in this PR.

sklearn/cluster/_birch.py

sklearn/metrics/_pairwise_distances_reduction.pyx

thomasjpfan · 2021-10-18T21:52:49Z

sklearn/metrics/_dist_metrics.pyx

+
+    @final
+    cdef DTYPE_t proxy_dist(self, ITYPE_t i, ITYPE_t j) nogil:
+        return self.distance_metric.rdist(&self.X[i, 0],


It does. My concern is that _compute_and_reduce_distances_on_chunks is going to call into .surrogate_dist in a nested loop where each

I do no not think the static definition helps to optimize this away, although I would need to look into it with https://compiler-explorer.com/.

Anyways, I do not think this is actionable for now. In a follow up PR, it would be interesting to investigate if hard coding a inline function and avoiding the vtable look changes our benchmarks.

sklearn/metrics/_pairwise_distances_reduction.pyx

jjerphan · 2021-10-19T06:34:26Z

Answering @thomasjpfan's question:

Do you think we can split the PairwiseDistancesRadiusNeighborhood out into a follow up PR? This way we can focus on PairwiseDistancesReduction and reduce the diff in this PR.

Yes, I also would have preferred this PR to have a smaller granularity.

We can extract the patches to solely introduce PairwiseDistancesArgKmin and the other necessary abstractions (DatasetsPair). WDYT?

--

PS: @rth and I were/are in favor in splitting this PR is atomic ones, @ogrisel prefers to keep it as is.

jjerphan · 2021-10-20T13:31:44Z

A simple setup with perf(1) with:

X.shape=(100000, 100)
Y.shape=(10000, 100)

shows that most of cycles are spent in GEMM kernels for FastEuclideanPairwiseDistancesArgKmin.

When PRESCOTT kernels are used

-  100.00%        python
   -   87.78%        libopenblasp-r0-085ca80a.3.9.so
          84.30%        [.] dgemm_kernel_PRESCOTT
           2.49%        [.] dgemm_oncopy_PRESCOTT
           0.74%        [.] dgemm_beta_PRESCOTT
           0.18%        [.] blas_thread_server
           0.02%        [.] dgemm_tn
           0.01%        [.] ddot_k_PRESCOTT
           0.01%        [.] sched_yield@plt
           0.01%        [.] pthread_mutex_lock@plt
           0.01%        [.] blas_memory_free
           0.01%        [.] pthread_mutex_unlock@plt
           0.00%        [.] dgemm_
           0.00%        [.] blas_memory_alloc
           0.00%        [.] ddot_
   -    5.91%        _pairwise_distances_reduction.cpython-39-x86_64-linux-gnu.so
           5.91%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_37FastEuclideanPairwiseDistancesArgKmin__compute_and_reduce_distances_on_chunks
           0.00%        [.] __pyx_memoryview_slice_memviewslice
           0.00%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin__parallel_on_X_prange_iter_finalize
           0.00%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesReduction__parallel_on_X
   -    3.23%        _heap.cpython-39-x86_64-linux-gnu.so
           3.19%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_heap_push
           0.04%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_simultaneous_sort
   +    1.12%        python3.9
   -    0.94%        libgomp.so.1.0.0
           0.94%        [.] do_wait
           0.00%        [.] gomp_thread_start
           0.00%        [.] gomp_team_barrier_wait_end
   +    0.29%        [unknown]
   +    0.28%        libpthread-2.33.so
   +    0.23%        libc-2.33.so
   +    0.17%        libopenblasp-r0-5bebc122.3.13.dev.so
   +    0.04%        ld-2.33.so
   +    0.00%        cython_blas.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _multiarray_umath.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _cython_blas.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _ctypes.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _queue.cpython-39-x86_64-linux-gnu.so

When SKYLAKEX kernels are used

-  100.00%        python
   -   57.57%        libopenblasp-r0.3.17.so
          49.84%        [.] dgemm_kernel_SKYLAKEX
           4.44%        [.] dgemm_incopy_SKYLAKEX
           2.34%        [.] dgemm_oncopy_SKYLAKEX
           0.73%        [.] blas_thread_server
           0.12%        [.] dgemm_tn
           0.08%        [.] dot_compute
           0.01%        [.] dgemm_
           0.01%        [.] blas_memory_free
           0.01%        [.] blas_memory_alloc
           0.00%        [.] dgemm_beta_SKYLAKEX
   -   21.29%        _pairwise_distances_reduction.cpython-39-x86_64-linux-gnu.so
          21.28%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_37FastEuclideanPairwiseDistancesArgKmin__compute_and_reduce_distances_on_chunks
           0.01%        [.] __pyx_memoryview_slice_memviewslice
           0.00%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin__parallel_on_X_prange_iter_finalize
   -   11.71%        _heap.cpython-39-x86_64-linux-gnu.so
          11.58%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_heap_push
           0.12%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_simultaneous_sort
   -    3.43%        libgomp.so.1.0.0
           3.43%        [.] do_wait
   +    2.94%        python3.9
   +    2.29%        libc-2.33.so
   +    0.62%        [unknown]
   +    0.09%        ld-2.33.so
   +    0.02%        _multiarray_umath.cpython-39-x86_64-linux-gnu.so
   +    0.02%        libpthread-2.33.so
   +    0.01%        _dist_metrics.cpython-39-x86_64-linux-gnu.so
   +    0.01%        _reordering.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _cython_blas.cpython-39-x86_64-linux-gnu.so

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

ogrisel · 2021-10-20T16:33:18Z

Interesting, on a machine with AVX512 and the proper version of OpenBLAS, the function __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_37FastEuclideanPairwiseDistancesArgKmin__compute_and_reduce_distances_on_chunks starts to be significant (more than 20% of the time) which means that optimizing it (e.g. maybe loop unrolling, vectorization...) could start to make sense at some point. But not as part of this PR ;)

ogrisel · 2021-10-20T16:34:01Z

PS: @rth and I were/are in favor in splitting this PR is atomic ones, @ogrisel prefers to keep it as is.

I am fine with moving the radius neighbors code in a follow-up PR on top of this ArgKMin + infra PR.

sklearn/metrics/_pairwise_distances_reduction.pyx

Also reorder instructions to have X's before Y's. Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.

Taken and adapted from the description of scikit-learn#20254 written by Olivier. Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan · 2021-10-26T08:16:59Z

This PR has been superseded by #21462.

lorentzenchr · 2022-02-17T09:52:59Z

#22134 is merged.

github-actions bot added module:metrics cython labels Jun 11, 2021

ogrisel added the Performance label Jun 11, 2021

ogrisel reviewed Jun 14, 2021

View reviewed changes

sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved

jjerphan force-pushed the pairwise_aggregation_cython branch from 4d7500f to 7aa2c3c Compare June 14, 2021 13:09

jjerphan mentioned this pull request Jun 16, 2021

KNN-Search is not translation invariant #20278

Closed

jjerphan force-pushed the pairwise_aggregation_cython branch from ce73f0c to cb85791 Compare June 23, 2021 16:16

jjerphan added No Changelog Needed module:neighbors module:utils labels Jun 23, 2021

jjerphan added 6 commits June 23, 2021 18:24

Reintroduce deleted test_neighbors_heap

5abda94

Lint

bc8925e

Minify utils._heap definition file

e2bb562

Merge branch 'main' into pairwise_aggregation_cython

9e9065d

Post-merge black code formatting

ac76852

Spread datasets for the tests of the fast_sqeuclidean strategy

cac7313

So that the range correspond to actual datasets and not to datasets whose marginal spreads are in [0, 1].

jjerphan changed the title ~~EHN Optimise aggregations over pairwise distances~~ [WIP] ENH Optimise aggregations over pairwise distances Jun 24, 2021

jjerphan mentioned this pull request Jun 24, 2021

ParallelReduction class hierarchy (POC for scikit-learn#20254) jjerphan/scikit-learn#2

Merged

jjerphan added 10 commits June 30, 2021 09:34

Rectify test

8a06c3f

[WIP] Adapting to use class hierarchy

41bd644

[WIP] Adapting to use class hierarchy

568ed2a

[WIP] Adapting to use class hierarchy

2bf34aa

[WIP] Adapting to use class hierarchy

c1415d6

This test segfaults: test_neighbors.py::test_fast_sqeuclidean_correctness[1-10-5-1000]

[WIP] Adapting to use class hierarchy

49e247d

Refactor ArgKmin._reduce_on_chunks to pave the way to general interface for reductions. Private datastructures will have to be accessed via the implementation of this private method.

fixup! [WIP] Adapting to use class hierarchy

25c9a2c

[WIP] Adapting to use class hierarchy

eb8b931

Introduce ParallelReduction as a abstract class, and extend it using ArgKmin. FastSquaredEuclideanArgKmin extends ArgKmin for the "fast_sqeuclidean" strategy.

Move neighbors.DistanceMetric to metrics

e0d1c99

The associated _typedefs.pyx file has been moved to utils to avoid circular dependencies has it is being used in neighbors.

jeremiedbb reviewed Oct 18, 2021

View reviewed changes

Address review comments

7fa4a40

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

thomasjpfan reviewed Oct 18, 2021

View reviewed changes

ogrisel mentioned this pull request Oct 20, 2021

Performance of the RBF kernel in epsilon-SVR (SVM) #21312

Open

jjerphan and others added 3 commits October 20, 2021 16:02

Address review comments

4a89d7f

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

Fix config for 'pairwise_dist_chunk_size'

8f63e01

Delay and better scope arrays C ordering

2ad33ec

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

thomasjpfan reviewed Oct 22, 2021

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction.pyx Outdated Show resolved Hide resolved

Simplify counting for remainder chunks

eba6f03

Also reorder instructions to have X's before Y's. Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Oct 25, 2021

Remove PairwiseDistancesRadiusNeighborhood

a0def25

As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Oct 25, 2021

Remove DatasetsPair used for sparse datasets

ecf5a9e

As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.

jjerphan and others added 3 commits October 25, 2021 14:50

Better motivate heaps' parallel allocation

34468ad

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Remove PairwiseDistancesRadiusNeighborhood

fa424a4

As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.

Remove DatasetsPair used for sparse datasets

5678666

As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.

jjerphan changed the title ~~ENH Private Cython submodule for aggregations over pairwise distances~~ ENH Private Cython Submodule for Reduction over Pairwise Distances Oct 25, 2021

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Oct 26, 2021

Add some general notes about the implementations

45c7f6e

Taken and adapted from the description of scikit-learn#20254 written by Olivier. Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan mentioned this pull request Oct 26, 2021

ENH Pairwise Distances ArgKmin #21462

Closed

jjerphan added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Oct 26, 2021

jjerphan mentioned this pull request Dec 2, 2021

[ENH] std::vector to np.ndarray coercion cython/cython#4487

Open

jjerphan mentioned this pull request Dec 23, 2021

ENH Rework PairwiseDistancesArgKmin to use a class method .compute jjerphan/scikit-learn#5

Merged

lorentzenchr closed this Feb 17, 2022

jjerphan deleted the pairwise_aggregation_cython branch October 21, 2022 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Private Cython Submodule for Reduction over Pairwise Distances #20254

ENH Private Cython Submodule for Reduction over Pairwise Distances #20254

jjerphan commented Jun 11, 2021 •

edited

Loading

jjerphan commented Jun 23, 2021

jeremiedbb left a comment •

edited

Loading

jjerphan commented Oct 18, 2021

thomasjpfan left a comment

thomasjpfan Oct 18, 2021

jjerphan commented Oct 19, 2021 •

edited

Loading

jjerphan commented Oct 20, 2021 •

edited

Loading

ogrisel commented Oct 20, 2021 •

edited

Loading

ogrisel commented Oct 20, 2021

jjerphan commented Oct 26, 2021

lorentzenchr commented Feb 17, 2022

ENH Private Cython Submodule for Reduction over Pairwise Distances #20254

ENH Private Cython Submodule for Reduction over Pairwise Distances #20254

Conversation

jjerphan commented Jun 11, 2021 • edited Loading

⚠ This PR has been superseded by #21462

What does this implement/fix? Explain your changes.

Interfaces benefiting from this PR

Last Benchmarks

Any other comments?

Main changes:

jjerphan commented Jun 23, 2021

jeremiedbb left a comment • edited Loading

Choose a reason for hiding this comment

jjerphan commented Oct 18, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Oct 18, 2021

Choose a reason for hiding this comment

jjerphan commented Oct 19, 2021 • edited Loading

jjerphan commented Oct 20, 2021 • edited Loading

ogrisel commented Oct 20, 2021 • edited Loading

ogrisel commented Oct 20, 2021

jjerphan commented Oct 26, 2021

lorentzenchr commented Feb 17, 2022

jjerphan commented Jun 11, 2021 •

edited

Loading

jeremiedbb left a comment •

edited

Loading

jjerphan commented Oct 19, 2021 •

edited

Loading

jjerphan commented Oct 20, 2021 •

edited

Loading

ogrisel commented Oct 20, 2021 •

edited

Loading