-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Private Cython Submodule for Reduction over Pairwise Distances #20254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Private Cython Submodule for Reduction over Pairwise Distances #20254
Conversation
4d7500f
to
7aa2c3c
Compare
Move neighbors.NeighborsHeap's code and _typedefs under sklearn.utils as cyclic imports are currently happening between sklearn.neighbors and sklearn.metrics. Also, using integral in some cases gave unexpected results. Occurences were changed to use np.int_p, as exposed by utils._typedefs.ITYPE_t (we don't need signed integer)
ce73f0c
to
cb85791
Compare
(Small note: force-pushing allow me here to have a somewhat clean history during adaptation prior to a first review). |
So that the range correspond to actual datasets and not to datasets whose marginal spreads are in [0, 1].
This test segfaults: test_neighbors.py::test_fast_sqeuclidean_correctness[1-10-5-1000]
The segfaults was due to reallocation of on the same pointers, causing multiple freeing on the same reference and memory leaks. To resolve this, arrays of pointers for local datastructures are allocated at the initialisation of the interface so that they can be handled separately in threads with proper allocation and deallocation. The memory management will be wrapped in subsequent private template method for each types of reduction and parallelisation strategy. This is one of the next iteration.
Refactor ArgKmin._reduce_on_chunks to pave the way to general interface for reductions. Private datastructures will have to be accessed via the implementation of this private method.
Introduce ParallelReduction as a abstract class, and extend it using ArgKmin. FastSquaredEuclideanArgKmin extends ArgKmin for the "fast_sqeuclidean" strategy.
The associated _typedefs.pyx file has been moved to utils to avoid circular dependencies has it is being used in neighbors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rest of the comments. I really like test_pairwise_distances_reduction
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
To re-answer @thomasjpfan's question (still present in a thread but hardly accessible because of new activity and impossibility to access it via a link):
Yes, it does. I do not think we can both have design flexibility without v-tables. V-table implementation details
struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric {
__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t);
__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t);
__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*csr_dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*csr_rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
int (*pdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice);
int (*cdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*_rdist_to_dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t);
__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*_dist_to_rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t);
};
struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_EuclideanDistance {
struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric __pyx_base;
};
static struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance = &__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base = *__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_DistanceMetric;
__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base.dist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance_dist;
__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base.rdist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance_rdist;
__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base._rdist_to_dist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance__rdist_to_dist;
__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base._dist_to_rdist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance__dist_to_rdist;
static PyObject *__pyx_tp_new_7sklearn_7metrics_13_dist_metrics_EuclideanDistance(PyTypeObject *t, PyObject *a, PyObject *k) {
struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *p;
#if CYTHON_COMPILING_IN_LIMITED_API
newfunc new_func = (newfunc)PyType_GetSlot(__pyx_ptype_7sklearn_7metrics_13_dist_metrics_DistanceMetric, Py_tp_new);
PyObject *o = new_func(t, a, k);
#else
PyObject *o = __pyx_tp_new_7sklearn_7metrics_13_dist_metrics_DistanceMetric(t, a, k);
#endif
if (unlikely(!o)) return 0;
p = ((struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *)o);
p->__pyx_base.__pyx_vtab = (struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric*)__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
return o;
}
/* "sklearn/metrics/_dist_metrics.pyx":1334
* @final
* cdef DTYPE_t proxy_dist(self, ITYPE_t i, ITYPE_t j) nogil:
* return self.distance_metric.rdist(&self.X[i, 0], # <<<<<<<<<<<<<<
* &self.Y[j, 0],
* self.d)
*/
__pyx_t_5 = ((struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric *)__pyx_v_self->__pyx_base.distance_metric->__pyx_vtab)->rdist(__pyx_v_self->__pyx_base.distance_metric, (&(*((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *) ( /* dim=1 */ ((char *) (((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *) ( /* dim=0 */ (__pyx_v_self->X.data + __pyx_t_1 * __pyx_v_self->X.strides[0]) )) + __pyx_t_2)) )))), (&(*((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *) ( /* dim=1 */ ((char *) (((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *) ( /* dim=0 */ (__pyx_v_self->Y.data + __pyx_t_3 * __pyx_v_self->Y.strides[0]) )) + __pyx_t_4)) )))), __pyx_v_self->d); if (unlikely(__pyx_t_5 == ((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t)-1.0))) __PYX_ERR(1, 1334, __pyx_L1_error)
__pyx_r = __pyx_t_5;
goto __pyx_L0; I do not know much about C compilers, but are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not done yet, there are so many things to think about with this PR.
Do you think we can split the PairwiseDistancesRadiusNeighborhood
out into a follow up PR? This way we can focus on PairwiseDistancesReduction
and reduce the diff in this PR.
|
||
@final | ||
cdef DTYPE_t proxy_dist(self, ITYPE_t i, ITYPE_t j) nogil: | ||
return self.distance_metric.rdist(&self.X[i, 0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. My concern is that _compute_and_reduce_distances_on_chunks
is going to call into .surrogate_dist
in a nested loop where each
I do no not think the static definition helps to optimize this away, although I would need to look into it with https://compiler-explorer.com/.
Anyways, I do not think this is actionable for now. In a follow up PR, it would be interesting to investigate if hard coding a inline function and avoiding the vtable look changes our benchmarks.
Answering @thomasjpfan's question:
Yes, I also would have preferred this PR to have a smaller granularity. We can extract the patches to solely introduce -- PS: @rth and I were/are in favor in splitting this PR is atomic ones, @ogrisel prefers to keep it as is. |
A simple setup with
shows that most of cycles are spent in GEMM kernels for When PRESCOTT kernels are used
When SKYLAKEX kernels are used
|
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Interesting, on a machine with AVX512 and the proper version of OpenBLAS, the function |
Also reorder instructions to have X's before Y's. Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.
As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.
As to make scikit-learn#20254 smaller. The removed hunks will be re-introduced in a subsequent PR.
Taken and adapted from the description of scikit-learn#20254 written by Olivier. Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This PR has been superseded by #21462. |
#22134 is merged. |
⚠ This PR has been superseded by #21462
What does this implement/fix? Explain your changes.
Aggregations over pairwise distances (argmin, argkmin, threshold filtering, cumulative sum, etc.) are one of the foundations used by many algorithms within scikit-learn.
Various strategies exist to compute them different, including some using chunks (see
sklearn/metrics/pairwise.py
). However, those are still mainly implemented at the level of Python leaving some rooms for manœuvre, especially for potential optimizations which can be made at a lower level.This PR explores new implementations for such aggregations using Cython to have a finer control over computations (similarly to what has been explored and implemented for K-means).
Importantly, the core of the computation is chunked to make sure that the pairwise distance chunk matrices stay in CPU cache before applying the final reduction step. This can lead to several x speed-ups alone. Furthermore, the chunking strategy is also used to leverage OpenMP-based paralellism (using Cython
prange
loop) which gives another multiplicative speed-up in favorable cases on many-core machines.Interfaces benefiting from this PR
Core interfaces:
sklearn.neighbors.KNeighborsMixin.kneighbors
sklearn.neighbors.RadiusNeighborsMixin.radius_neighbors
sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances_argmin_min
User facing interfaces (might not be exhaustive):
sklearn.cluster.Birch
sklearn.cluster.DBSCAN
sklearn.cluster.MeanShift
sklearn.cluster.OPTICS
sklearn.cluster.SpectralClustering
sklearn.feature_selection.mutual_info_regression
sklearn.neighbors.AffinityPropagation
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.KNeighborsRegressor
sklearn.neighbors.LocalOutlierFactor
sklearn.neighbors.NearestNeighbors
sklearn.manifold.Isomap
sklearn.manifold.LocallyLinearEmbedding
sklearn.manifold.TSNE
sklearn.manifold.trustworthiness
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
Last Benchmarks
Benchmarks are made via https://github.com/jjerphan/scikit-learn/tree/benchmarks/pairwise_aggregation_cython.
For
kneighbors
(1.2× to 20× speed-up): #20254 (comment)fast_sqeuclidean
metricRan with:
Standard metric (here
manhattan
)BruteForceNearestNeighborsBenchmark
as been modified on top of in 3adec08 to usemetric='manhattan'
and ran with:Any other comments?
The initial experiments have been in
scikit-learn-inria-fondation/pdist_aggregation
Subsequent work include:
return_distance=False
where applicable onsklearn.neighbors.KNeighborsMixin.kneighbors
andsklearn.neighbors.RadiusNeighborsMixin.radius_neighbors
calls.ArgKmin(k=1)
underArgMin
PairwiseDistancesReduction
forKMeans
and adapt its implementation accordinglyX_train
once and attach it toKNeighborsMixin
s'static'
vs'guided'
)Main changes:
NeighborsHeap
implementations in a private submodule and factor code fromneighbors._binary_tree.NeighborsHeap
as it is being reused hereDistanceMetrics
fromneighbors
tometrics
and unify typessklearn.neighbors.DistanceMetrics
because unfortunately it's part of the public API of scikit-learn.'fast_sqeuclidean'
strategy (the current one using the_gemm
trick) and document it appropriately (tl;dr fast, but unstable when far from the origin)pairwise_distances_argmin_min
RadiusNeighborsMixin._radius_neighbors_reduce_func
)metric="fast_sqeuclidean"
inKNeighborsClassifier
/Regressor
by default to unlock maximum prediction performance while give the users full control to use a more numerically stable alternative otherwise.DatasetsPair
)