Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
e4bf4e7
introduce PairwiseDistance
Vincent-Maladiere Feb 7, 2023
ddc464b
temporarily swap 'sqeuclidean' with 'euclidean' to fix tests
Vincent-Maladiere Feb 7, 2023
2a8b47a
Add nogil for benchmark purposes
Vincent-Maladiere Feb 8, 2023
fd4117b
update is_usable_for and docstrings
Vincent-Maladiere Feb 11, 2023
cfa7d8c
update whats_new
Vincent-Maladiere Feb 11, 2023
dd60dd5
Merge branch 'feature/PairwiseDistances' into feat/pairwise_distances…
Vincent-Maladiere Feb 11, 2023
5d9dcdb
remove chunksize and extend tests
Vincent-Maladiere Feb 14, 2023
8787747
Apply suggestions from code review
Vincent-Maladiere Feb 14, 2023
3fd55b9
add test_pairwise_distances_is_usable_for
Vincent-Maladiere Feb 14, 2023
e8dfef5
fix monkeypatch test and extend is_usable_for to single threaded manh…
Vincent-Maladiere Feb 14, 2023
77952d8
Merge branch 'main' into feat/pairwise_distances-pdr-backend
jjerphan Mar 8, 2023
9eed73f
DOC Add comments
jjerphan Mar 8, 2023
502725d
Convert sparse matrices to be CSR
jjerphan Mar 8, 2023
face5a9
Correct sqeucliden adaptation
jjerphan Mar 8, 2023
cdd2567
Correct condition for PairwiseDistances.is_usable_for
jjerphan Mar 9, 2023
32b4dad
TST Adapt test_pairwise_distances_is_usable_for
jjerphan Mar 9, 2023
9d6cd41
Use threadpool_limits over ThreadpoolController
jjerphan Mar 9, 2023
304d7d8
TST Increase atol for test_euclidean_distances_extreme_values
jjerphan Mar 9, 2023
99de991
DOC Add docstring for PairwiseDistances.is_usable_for
jjerphan Mar 9, 2023
4967009
FEA Introduce `PairwiseDistances`, a generic back-end for `pairwise_d…
jjerphan Mar 20, 2023
4f6b935
Merge branch 'main' into feature/PairwiseDistances
jjerphan Mar 20, 2023
40e4ff3
Merge branch 'main' into feat/pairwise_distances-pdr-backend
Vincent-Maladiere Jun 21, 2023
cb518ab
finalize merging with main by removing deprecated DTYPE and ITYPE
Vincent-Maladiere Jun 21, 2023
1754d18
Merge branch 'feature/PairwiseDistances' into feat/pairwise_distances…
Vincent-Maladiere Jun 21, 2023
0ba0ca6
Merge remote-tracking branch 'origin/main' into feature/PairwiseDista…
glemaitre Jun 21, 2023
8e94b7f
DOC Update link to best sphinx version for doc build (#26626)
lucyleeow Jun 21, 2023
53bdbe5
MNT add isort to ruff's rules (#26649)
adrinjalali Jun 21, 2023
e04160c
FIX use None as default in HDBSCAN (#26650)
glemaitre Jun 21, 2023
54600aa
Merge branch 'feature/PairwiseDistances' into feat/pairwise_distances…
Vincent-Maladiere Jun 22, 2023
b35d04c
Merge branch 'main' into pwd_generic
Micky774 Jul 31, 2023
5bdabfe
Merge branch 'main' into pwd_generic
Micky774 Aug 1, 2023
554294f
Update changelog
Micky774 Aug 1, 2023
8eff0da
Movex X_is_Y to PairwiseDistances
Micky774 Aug 1, 2023
e059b3b
Removed X_is_Y from datasets
Micky774 Aug 1, 2023
a70092c
Reverted dataset pair files
Micky774 Aug 1, 2023
e7842a9
Preserve `_sparse_manhattan` fast-path to avoid regression
Micky774 Aug 1, 2023
f4c1a7d
Relax usability constraints and un-special-case manhattan
Micky774 Aug 2, 2023
74c7085
Adjusted for `n_jobs` semantics and changed dispatch heuristic
Micky774 Aug 2, 2023
7966fa8
Removed old comments
Micky774 Aug 2, 2023
8ff2475
Remove now unnecessary test
Micky774 Aug 2, 2023
8a2a70b
Updated semantics to avoid accepting sqeuclidean
Micky774 Aug 2, 2023
816d665
Clean
Micky774 Aug 2, 2023
230d88d
Added heuristic for better backend dispatch
Micky774 Aug 2, 2023
2664b6c
Corrected use of n_jobs
Micky774 Aug 2, 2023
287e725
Revert changes to `manhattan_distances`
Micky774 Aug 2, 2023
7254cb0
Simplified tests
Micky774 Aug 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx
sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pxd
sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx
sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pxd
sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pyx
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pxd
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pyx
sklearn/neighbors/_ball_tree.pyx
Expand Down
12 changes: 12 additions & 0 deletions doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,18 @@ Changelog
for CSR × CSR, Dense × CSR, and CSR × Dense datasets is now 1.5x faster.
:pr:`26765` by :user:`Meekail Zain <micky774>`

- |Efficiency| :func:`pairwise.pairwise_distances`' performance has been improved
when providing dense datasets.
:pr:`26983` by :user:`Meekail Zain <micky774>` and
:user:`Vincent Maladiere <Vincent-Maladiere>` and
:user:`Julien Jerphanion <jjerphan>`.

- |Feature| :func:`pairwise.pairwise_distances` now supports combination of
dense arrays and sparse CSR matrices datasets.
:pr:`26983` by :user:`Meekail Zain <micky774>` and
:user:`Vincent Maladiere <Vincent-Maladiere>` and
:user:`Julien Jerphanion <jjerphan>`.

:mod:`sklearn.utils`
....................

Expand Down
2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ ignore =
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx
sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pxd
sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx
sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pxd
sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pyx
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pxd
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pyx
sklearn/neighbors/_ball_tree.pyx
Expand Down
6 changes: 6 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,12 @@ def check_package_status(package, min_version):
"include_np": True,
"extra_compile_args": ["-std=c++11"],
},
{
"sources": ["_pairwise_distances.pyx.tp", "_pairwise_distances.pxd.tp"],
"language": "c++",
"include_np": True,
"extra_compile_args": ["-std=c++11"],
},
{
"sources": ["_argkmin.pyx.tp", "_argkmin.pxd.tp"],
"language": "c++",
Expand Down
2 changes: 2 additions & 0 deletions sklearn/metrics/_pairwise_distances_reduction/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,13 +90,15 @@
ArgKmin,
ArgKminClassMode,
BaseDistancesReductionDispatcher,
PairwiseDistances,
RadiusNeighbors,
sqeuclidean_row_norms,
)

__all__ = [
"BaseDistancesReductionDispatcher",
"ArgKmin",
"PairwiseDistances",
"RadiusNeighbors",
"ArgKminClassMode",
"sqeuclidean_row_norms",
Expand Down
148 changes: 148 additions & 0 deletions sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@
ArgKminClassMode64,
)
from ._base import _sqeuclidean_row_norms32, _sqeuclidean_row_norms64
from ._pairwise_distances import (
PairwiseDistances32,
PairwiseDistances64,
)
from ._radius_neighbors import (
RadiusNeighbors32,
RadiusNeighbors64,
Expand Down Expand Up @@ -153,6 +157,150 @@ def compute(
"""


class PairwiseDistances(BaseDistancesReductionDispatcher):
"""Compute the pairwise distances matrix for two sets of vectors.

The distance function `dist` depends on the values of the `metric`
and `metric_kwargs` parameters.

This class only computes the pairwise distances matrix without
applying any reduction on it. It shares most of the underlying
code infrastructure with reducing variants to leverage multi-thread
parallelism. However contrary to the reducing variants, no chunking
is applied to allow for contiguous write access to the final distance
array that is not expected to fit in the CPU cache in general.

This class is not meant to be instantiated, one should only use
its :meth:`compute` classmethod which handles allocation and
deallocation consistently.
"""

@classmethod
def valid_metrics(cls) -> List[str]:
excluded = {
# PyFunc cannot be supported because it necessitates interacting with
# the CPython interpreter to call user defined functions.
"pyfunc",
"mahalanobis", # is numerically unstable
# In order to support discrete distance metrics, we need to have a
# stable simultaneous sort which preserves the order of the indices
# because there generally is a lot of occurrences for a given values
# of distances in this case.
# TODO: implement a stable simultaneous_sort.
"hamming",
*BOOL_METRICS,
}
return sorted(set(METRIC_MAPPING64.keys()) - excluded)

@classmethod
def compute(
cls,
X,
Y,
metric="euclidean",
metric_kwargs=None,
strategy=None,
n_jobs=-1,
):
"""Return pairwise distances matrix for the given arguments.

Parameters
----------
X : ndarray or CSR matrix of shape (n_samples_X, n_features)
Input data.

Y : ndarray or CSR matrix of shape (n_samples_Y, n_features)
Input data.

metric : str, default='euclidean'
The distance metric to use.
For a list of available metrics, see the documentation of
:class:`~sklearn.metrics.DistanceMetric`.

metric_kwargs : dict, default=None
Keyword arguments to pass to specified metric function.

strategy : str, {'auto', 'parallel_on_X', 'parallel_on_Y'}, default=None
The strategy defining which dataset parallelization are made on.

For both strategies the computations happens with two nested loops,
respectively on rows of X and rows of Y.
Strategies differs on which loop (outer or inner) is made to run
in parallel with the Cython `prange` construct:

- 'parallel_on_X' dispatches rows of X uniformly on threads.
Each thread then iterates on all the rows of Y. This strategy is
embarrassingly parallel and comes with no datastructures
synchronisation.

- 'parallel_on_Y' dispatches rows of Y uniformly on threads.
Each thread processes all the rows of X in turn. This strategy is
a sequence of embarrassingly parallel subtasks (the inner loop on Y
chunks) with no intermediate datastructures synchronisation.

- 'auto' relies on a simple heuristic to choose between
'parallel_on_X' and 'parallel_on_Y': when `X.shape[0]` is large enough,
'parallel_on_X' is usually the most efficient strategy.
When `X.shape[0]` is small but `Y.shape[0]` is large, 'parallel_on_Y'
brings more opportunity for parallelism and is therefore more efficient.

- None (default) looks-up in scikit-learn configuration for
`pairwise_dist_parallel_strategy`, and use 'auto' if it is not set.

n_jobs : int, default=None
The number of jobs to use for the computation. This works by breaking
down the pairwise matrix into n_jobs even slices and computing them in
parallel.

``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.

Returns
-------
pairwise_distances_matrix : ndarray of shape (n_samples_X, n_samples_Y)
The pairwise distances matrix.

Notes
-----
This public classmethod is responsible for introspecting the arguments
values to dispatch to the private dtype-specialized implementation of
:class:`PairwiseDistances`.

All temporarily allocated datastructures necessary for the concrete
implementations are therefore freed when this classmethod returns.

This allows entirely decoupling the API entirely from the
implementation details whilst maintaining RAII.
"""
n_jobs = 1 if n_jobs is None else n_jobs
Y = X if Y is None else Y
if X.dtype == Y.dtype == np.float64:
return PairwiseDistances64.compute(
X=X,
Y=Y,
metric=metric,
metric_kwargs=metric_kwargs,
strategy=strategy,
n_jobs=n_jobs,
)

if X.dtype == Y.dtype == np.float32:
return PairwiseDistances32.compute(
X=X,
Y=Y,
metric=metric,
metric_kwargs=metric_kwargs,
strategy=strategy,
n_jobs=n_jobs,
)

raise ValueError(
"Only float64 or float32 datasets pairs are supported, but "
f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}."
)


class ArgKmin(BaseDistancesReductionDispatcher):
"""Compute the argkmin of row vectors of X on the ones of Y.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{{py:

implementation_specific_values = [
# Values are the following ones:
#
# name_suffix, INPUT_DTYPE_t
#
#
('64', 'cnp.float64_t'),
('32', 'cnp.float32_t')
]

}}
cimport numpy as cnp

from ...utils._typedefs cimport intp_t
{{for name_suffix, INPUT_DTYPE_t in implementation_specific_values}}

from ._datasets_pair cimport DatasetsPair{{name_suffix}}


cdef class PairwiseDistances{{name_suffix}}:
"""float{{name_suffix}} implementation of PairwiseDistances."""

cdef:
readonly DatasetsPair{{name_suffix}} datasets_pair

intp_t n_samples_X, n_samples_Y
intp_t effective_n_threads
bint X_is_Y
bint execute_in_parallel_on_Y

{{INPUT_DTYPE_t}}[:, ::1] pairwise_distances_matrix

cdef void _parallel_on_X(self) nogil

cdef void _parallel_on_Y(self) nogil


{{endfor}}
Loading