-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Open
Labels
Meta-issueGeneral issue associated to an identified list of tasksGeneral issue associated to an identified list of tasksPerformancecython
Description
PairwiseDistancesReductions
have been introduced as a hierarchy of Cython classes to implement back-ends of some scikit-learn algorithms.
Initial work was listed in #22587.
💡 See this presentation for a better understanding of the design of PairwiseDistancesReductions
.
💡 Before working on improving performance, one must profile the current execution of algorithms to see if there are any substantial benefits.
Subsequent possible work includes:
- FEA Introduce
PairwiseDistances
, a generic back-end forpairwise_distances
#26983 - FEA
PairwiseDistancesReductions
: support for BooleanDistanceMetrics
via stable simultaneous sort #25097 - Support
"precomputed"
distances - Implement back-end for other
Estimators
orTransformers
, in particular, introduce a specialised back-end for which would remove the costly sequential portion around the current call tokneighbors
orradius_neighbors
for (non-complete list):- computing the MST from the data matrix (
mst_from_data_matrix
) see DOC Update_hdbscan/_linkage.pyx
with new inline comments #25656 (review) - computing the MST linkage (
mst_linkage_core
) -
LocalOutlierFactor
(no further optimization possible) -
KNeighbors*.predict*
- PERF Implement
PairwiseDistancesReduction
backend forKNeighbors.predict_proba
#24076 by @Micky774 - PERF Euclidean Specialization for ArgKminClassMode #28219 by @OmarManzoor
- Implement multi-output support (WIP here)
- Support
"distance"
weighting
- PERF Implement
-
RadiusNeighbors*.predict*
- PERF Implement
PairwiseDistancesReduction
backend forRadiusNeighbors.predict_proba
#26828 - Implement Euclidean specialization
- Implement multi-output support
- Support
"distance"
weighting
- PERF Implement
- some part of
KMeans
- computing the MST from the data matrix (
- Force the use of
PairwiseDistancesReductions
for F-contiguous arrays- use a scikit-learn configuration entry to accept and convert F-contiguous arrays to C-contiguous arrays
- Release the GIL before calling
parallel_on_{X|Y}
instead of releasing it in those methods- 💡 I once have tried and I don't think this is possible.
- Integrate
mimalloc
to have proper memory allocation for multi-threaded implementations:- 💡 Changing the default implementation of
malloc(1)
might come with unexpected changes and might break the whole ecosystem. Also maintaining it might be costly.
- 💡 Changing the default implementation of
- Tests improvements ideas:
- TST Improve
assert_argkmin_results_quasi_equality
error message #27281 (comment) - parametrize old tests for public API backed by
PairwiseDistancesReductions
? - systematically test public API on combinations of sparse and dense datasets?
- TST Improve
- Extend this design for a new back-end for
pairwise_kernels
-based interfaces
Metadata
Metadata
Assignees
Labels
Meta-issueGeneral issue associated to an identified list of tasksGeneral issue associated to an identified list of tasksPerformancecython