Skip to content

FEAT Support precomputed distance matrix for PairwiseDistancesReductions #29483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 80 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
722033c
Co-authored-by: @jjerphan Add initial commit
adam2392 Mar 30, 2023
124a8dc
Merge branch 'main' into precomputed
adam2392 Feb 17, 2024
cd1e146
Merge branch 'main' into precomputed
kyrajeep Jul 8, 2024
046898b
Add precomputed option in dispatcher classes
kyrajeep Jul 12, 2024
02b5145
comments, questions, code to feature precomputed and maintain class h…
kyrajeep Jul 13, 2024
d32e593
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Jul 14, 2024
5351d1a
Add the precomputed option in argkmin_classmode
kyrajeep Jul 15, 2024
5c71f6d
Merge remote-tracking branch 'refs/remotes/origin/feat_precomputed' i…
kyrajeep Jul 15, 2024
50b95ea
Update is_usable_for to use the dispatcher for precomputed
kyrajeep Jul 15, 2024
096ab56
Changed the name from distance_matrix to precomputed_distance
kyrajeep Jul 15, 2024
194ac01
Start of changing BaseDistancesReduction superclass
kyrajeep Jul 22, 2024
de74336
Add the function to return the precomputed matrix
kyrajeep Jul 22, 2024
e7c1e52
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Jul 22, 2024
f2f31ea
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Jul 28, 2024
f239c24
Add to BaseDistanceReduction the function to take in and return a pre…
kyrajeep Jul 28, 2024
d961e3b
Add comments to work on the subclass of DatasetsPair
kyrajeep Jul 28, 2024
316c72a
Alter the compute method of ArgKminClassMode to take in precomputed m…
kyrajeep Aug 1, 2024
3a027c7
Delete `venv_sklearn`
jjerphan Aug 5, 2024
a9fc4a8
Delete a line regarding precomputed since this file is not to be changed
kyrajeep Aug 7, 2024
c57f2d5
Merge remote-tracking branch 'refs/remotes/origin/feat_precomputed' i…
kyrajeep Aug 7, 2024
a7d14d2
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Aug 7, 2024
08c62c4
Update sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.p…
kyrajeep Aug 7, 2024
43bb4fc
Add venv/ to ignore my local virtual environment
kyrajeep Aug 7, 2024
c435341
Merge branch 'feat_precomputed' of https://github.com/kyrajeep/scikit…
kyrajeep Aug 9, 2024
7126430
Write the methods to fulfill the abstract class requirements
kyrajeep Aug 9, 2024
56e09e9
Set default for input data to none and check if provided correctly fo…
kyrajeep Aug 15, 2024
490aa3c
Update the XOR check
kyrajeep Aug 15, 2024
ef8e9b5
Modify get_for classmethod
kyrajeep Aug 22, 2024
2d01d3e
Modify get_for classmethod
kyrajeep Aug 22, 2024
2bf45c4
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Aug 24, 2024
71106c6
Modify dispatcher's {ArgKmin, RadiusNeighbors}{ClassMode}{32,64} for …
kyrajeep Aug 25, 2024
1fe4007
fixed by the linter black
kyrajeep Aug 25, 2024
979358a
Update sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.p…
kyrajeep Aug 27, 2024
6728d12
Correct the syntax as requested
kyrajeep Aug 27, 2024
05766f9
Remove default value of none
kyrajeep Aug 27, 2024
0feb10c
Remove blank lines..
kyrajeep Aug 27, 2024
acadf77
Update pyproject.toml
kyrajeep Aug 29, 2024
74adecc
Update the doctring
kyrajeep Aug 29, 2024
9ffc78e
Remove blank lines
kyrajeep Aug 29, 2024
c950e07
Remove a questions in comment
kyrajeep Aug 29, 2024
fdbe05e
Update .gitignore to not commit a virtual env
kyrajeep Aug 29, 2024
d138e45
Remove unnecessary blanks
kyrajeep Aug 29, 2024
7825942
Pass the precomuted matrix as a DatasetsPair object directly
kyrajeep Oct 1, 2024
f3a016f
Initial commit to check the precomputed input array size
kyrajeep Oct 4, 2024
d0a9021
Added tests to check for NaNs, np.array, data types
kyrajeep Oct 4, 2024
3a3b2ea
Revert some unnecessary type casting
kyrajeep Oct 21, 2024
c5e4bfd
Fixing minor issues
kyrajeep Oct 28, 2024
bd2d051
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Nov 15, 2024
08f3d0a
Fix the tests
kyrajeep Nov 16, 2024
84a491a
bug fixes regarding data types, variable declaration with cython
kyrajeep Nov 18, 2024
23c5a66
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Nov 18, 2024
d3f556f
Delete all venv files
kyrajeep Nov 18, 2024
6dcce7a
Merge branch 'feat_precomputed' of https://github.com/kyrajeep/scikit…
kyrajeep Nov 18, 2024
d6670e9
Delete venv_sklearn directory
kyrajeep Nov 18, 2024
0e4eef3
Merge branch 'feat_precomputed' of https://github.com/kyrajeep/scikit…
kyrajeep Nov 25, 2024
7b4980a
Delete a blank line
kyrajeep Nov 25, 2024
18bc6d8
Fix formatting
kyrajeep Nov 25, 2024
d255bc5
Fix formatting
kyrajeep Nov 25, 2024
b65657c
Remove with gil for precomputed matrix
kyrajeep Nov 25, 2024
b26768f
Merge branch 'feat_precomputed' of https://github.com/kyrajeep/scikit…
kyrajeep Nov 25, 2024
45ea5f7
Fix formatting
kyrajeep Nov 25, 2024
f0efc37
Add tests for ValueErrors and fix the XOR for the method, is_usable.
kyrajeep Dec 4, 2024
7cbfd7a
Merge branch 'feat_precomputed' of https://github.com/kyrajeep/scikit…
kyrajeep Dec 4, 2024
8f15a44
syntax fix
kyrajeep Dec 6, 2024
38e5b84
syntax fix
kyrajeep Dec 6, 2024
a97e0c5
syntax fix
kyrajeep Dec 6, 2024
4d10786
syntax fix
kyrajeep Dec 6, 2024
305a90a
Remove type annotations
kyrajeep Dec 6, 2024
b7da97d
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Dec 7, 2024
4334074
Merge branch 'scikit-learn:main' into feat_precomputed
kyrajeep Dec 9, 2024
22c5d14
Revert change
kyrajeep Dec 13, 2024
8143dc5
Revert change
kyrajeep Dec 18, 2024
59eccd4
Merge branch 'feat_precomputed' of https://github.com/kyrajeep/scikit…
kyrajeep Dec 18, 2024
c13dc12
Merge branch 'main' into feat_precomputed
kyrajeep Jan 22, 2025
eb7bf54
Test the precomputed input against the actual computation for a sample
kyrajeep Feb 25, 2025
8d55978
Add "precomputed" as one of the options for "metric"
kyrajeep Feb 25, 2025
46ece7a
Change DatasetsPair to use the metric option to indicate precomputed
kyrajeep Mar 4, 2025
78244a0
API change to include precomputed as a metric in the base abstract class
kyrajeep Mar 4, 2025
6667775
API change to include precomputed as one of the metrics: check X and Y
kyrajeep Mar 10, 2025
22b8c02
Change API to enable precomputed as a metric and take in X=precompute…
kyrajeep Mar 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions sklearn/metrics/_dist_metrics.pyx.tp
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,7 @@ cdef class DistanceMetric:
# metric mappings
# These map from metric id strings to class names
METRIC_MAPPING{{name_suffix}} = {
'precomputed': PrecomputedDistanceMatrix{{name_suffix}}
'euclidean': EuclideanDistance{{name_suffix}},
'l2': EuclideanDistance{{name_suffix}},
'minkowski': MinkowskiDistance{{name_suffix}},
Expand Down Expand Up @@ -359,13 +360,17 @@ cdef class DistanceMetric{{name_suffix}}(DistanceMetric):

**User-defined distance:**

=========== =============== =======
identifier class name args
----------- --------------- -------
"pyfunc" PyFuncDistance func
=========== =============== =======
=========== =============== =======
identifier class name args
----------- --------------- -------
"precomputed" PrecomputedDistanceMatrix precomputed
"pyfunc" PyFuncDistance func
=========== =============== =======

Here ``func`` is a function which takes two one-dimensional numpy
"precomputed" indicates that the user has the distance computed
and wants to pass in the precomputed as an argument.

``func`` is a function which takes two one-dimensional numpy
arrays, and returns a distance. Note that in order to be used within
the BallTree, the distance must be a true metric:
i.e. it must satisfy the following properties
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ from ._classmode cimport WeightingStrategy
{{for name_suffix in ["32", "64"]}}
from ._argkmin cimport ArgKmin{{name_suffix}}
from ._datasets_pair cimport DatasetsPair{{name_suffix}}
from ._datasets_pair cimport PrecomputedDistanceMatrix{{name_suffix}}

cdef class ArgKminClassMode{{name_suffix}}(ArgKmin{{name_suffix}}):
"""
Expand Down
3 changes: 1 addition & 2 deletions sklearn/metrics/_pairwise_distances_reduction/_base.pxd.tp
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,7 @@ cdef class BaseDistancesReduction{{name_suffix}}:
Implementations inherit from this template and may override the several
defined hooks as needed in order to easily extend functionality with
minimal redundant code.
"""

"""
cdef:
readonly DatasetsPair{{name_suffix}} datasets_pair

Expand Down
4 changes: 3 additions & 1 deletion sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,9 @@ cdef class BaseDistancesReduction{{name_suffix}}:
Implementations inherit from this template and may override the several
defined hooks as needed in order to easily extend functionality with
minimal redundant code.

If metric is 'precomputed' and the precomputed matrix is provided,
a subclass must be able to access it through the compute method.
"""

def __init__(
Expand All @@ -137,7 +140,6 @@ cdef class BaseDistancesReduction{{name_suffix}}:
):
cdef:
intp_t X_n_full_chunks, Y_n_full_chunks

if chunk_size is None:
chunk_size = get_config().get("pairwise_dist_chunk_size", 256)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ cdef class DatasetsPair{{name_suffix}}:
cdef float64_t surrogate_dist(self, intp_t i, intp_t j) noexcept nogil


cdef class PrecomputedDistanceMatrix{{name_suffix}}(DatasetsPair{{name_suffix}}):
cdef:
const {{INPUT_DTYPE_t}}[:, ::1] distance_matrix


cdef class DenseDenseDatasetsPair{{name_suffix}}(DatasetsPair{{name_suffix}}):
cdef:
const {{INPUT_DTYPE_t}}[:, ::1] X
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ cdef class DatasetsPair{{name_suffix}}:
@classmethod
def get_for(
cls,
X,
Y,
X = None,
Y = None,
metric="euclidean",
dict metric_kwargs=None,
) -> DatasetsPair{{name_suffix}}:
Expand Down Expand Up @@ -98,6 +98,9 @@ cdef class DatasetsPair{{name_suffix}}:
metric_kwargs = copy.copy(metric_kwargs)
metric_kwargs.pop("X_norm_squared", None)
metric_kwargs.pop("Y_norm_squared", None)
if metric = precomputed:
return PrecomputedDistanceMatrix{{name_suffix}}(precomputed)

cdef:
{{DistanceMetric}} distance_metric = DistanceMetric.get_metric(
metric,
Expand Down Expand Up @@ -158,6 +161,43 @@ cdef class DatasetsPair{{name_suffix}}:
# TODO: add "with gil: raise" here when supporting Cython 3.0
return -1


@final
cdef class PrecomputedDistanceMatrix{{name_suffix}}(DatasetsPair{{name_suffix}}):
"""A subclass of DatasetsPair

Parameters: must receive precomputed_distance: ndarray of shape
(n_samples_X, n_samples_Y),
Must be C-contiguous.
"""

def __init__(
self,
const {{INPUT_DTYPE_t}}[:, ::1] precomputed_distance,
):
super().__init__(
distance_metric=DistanceMetric{{name_suffix}}(),
n_features=0,
)
# This array has already been checked.
self.distance_matrix = precomputed_distance

@final
cdef intp_t n_samples_X(self) noexcept nogil:
return self.distance_matrix.shape[0]

@final
cdef intp_t n_samples_Y(self) noexcept nogil:
return self.distance_matrix.shape[1]

@final
cdef float64_t surrogate_dist(self, intp_t i, intp_t j) noexcept nogil:
return self.distance_matrix[i, j]

@final
cdef float64_t dist(self, intp_t i, intp_t j) noexcept nogil:
return self.distance_matrix[i, j]

@final
cdef class DenseDenseDatasetsPair{{name_suffix}}(DatasetsPair{{name_suffix}}):
"""Compute distances between row vectors of two arrays.
Expand Down
67 changes: 58 additions & 9 deletions sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,12 @@ def valid_metrics(cls) -> List[str]:
"hamming",
*BOOL_METRICS,
}
return sorted(({"sqeuclidean"} | set(METRIC_MAPPING64.keys())) - excluded)
return sorted(
({"sqeuclidean", "precomputed"} | set(METRIC_MAPPING64.keys())) - excluded
)

@classmethod
def is_usable_for(cls, X, Y, metric) -> bool:
def is_usable_for(cls, X=None, Y=None, metric="euclidean") -> bool:
"""Return True if the dispatcher can be used for the
given parameters.

Expand All @@ -96,6 +98,8 @@ def is_usable_for(cls, X, Y, metric) -> bool:
Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features)
Input data.

precomputed: ndarray of shape (n_samples_X, n_samples_Y)

metric : str, default='euclidean'
The distance metric to use.
For a list of available metrics, see the documentation of
Expand All @@ -105,7 +109,15 @@ def is_usable_for(cls, X, Y, metric) -> bool:
-------
True if the dispatcher can be used, else False.
"""

if metric == "precomputed":
if X is not None and Y is None:
is_usable = True
else:
is_usable = False

# is_usable = (X is not None and Y is not None) ^ bool(precomputed)
if is_usable == False:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't is_usable be undefined if metric != "precomputed"?

return is_usable
# FIXME: the current Cython implementation is too slow for a large number of
# features. We temporarily disable it to fallback on SciPy's implementation.
# See: https://github.com/scikit-learn/scikit-learn/issues/28191
Expand Down Expand Up @@ -188,9 +200,9 @@ class ArgKmin(BaseDistancesReductionDispatcher):
@classmethod
def compute(
cls,
X,
Y,
k,
X=None,
Y=None,
k=None,
metric="euclidean",
chunk_size=None,
metric_kwargs=None,
Expand Down Expand Up @@ -277,6 +289,25 @@ def compute(
for the concrete implementation are therefore freed when this classmethod
returns.
"""
"""
if X is None and Y is None and precomputed_matrix is None:
raise ValueError("Either X and Y or precomputed_matrix must be provided.")
elif X is not None and Y is not None and precomputed_matrix is not None:
raise ValueError(
"Only one of X and Y or precomputed_matrix must be provided."
)
elif X is None and Y is not None:
raise ValueError("Y should not be provided without X.")
elif X is not None and Y is None:
raise ValueError("X should not be provided without Y.")
"""

if metric == "precomputed":
if X is None:
raise ValueError("X should be provided as a precomputed value")
if Y is not None:
raise ValueError("Y should not be provided as a precomputed value")

if X.dtype == Y.dtype == np.float64:
return ArgKmin64.compute(
X=X,
Expand Down Expand Up @@ -326,9 +357,9 @@ class RadiusNeighbors(BaseDistancesReductionDispatcher):
@classmethod
def compute(
cls,
X,
Y,
radius,
X=None,
Y=None,
radius=None,
metric="euclidean",
chunk_size=None,
metric_kwargs=None,
Expand Down Expand Up @@ -421,6 +452,24 @@ def compute(
for the concrete implementation are therefore freed when this classmethod
returns.
"""
"""
if X is None and Y is None and precomputed is None:
raise ValueError("Either X and Y or precomputed must be provided.")
elif X is not None and Y is not None and precomputed is not None:
raise ValueError("Only one of X and Y or precomputed must be provided.")
elif X is None and Y is not None:
raise ValueError("Y should not be provided without X.")
elif X is not None and Y is None:
raise ValueError("X should not be provided without Y.")
elif precomputed:
return precomputed
"""
if metric == "precomputed":
if X is None:
raise ValueError("X should be provided as a precomputed value")
if Y is not None:
raise ValueError("Y should not be provided as a precomputed value")

if X.dtype == Y.dtype == np.float64:
return RadiusNeighbors64.compute(
X=X,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ cdef class RadiusNeighbors{{name_suffix}}(BaseDistancesReduction{{name_suffix}})
# Fall back on a generic implementation that handles most scipy
# metrics by computing the distances between 2 vectors at a time.
pda = RadiusNeighbors{{name_suffix}}(
datasets_pair=DatasetsPair{{name_suffix}}.get_for(X, Y, metric, metric_kwargs),
datasets_pair=DatasetsPair{{name_suffix}}.get_for(X, Y, precomputed, metric, metric_kwargs),
radius=radius,
chunk_size=chunk_size,
strategy=strategy,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ cdef class RadiusNeighborsClassMode{{name_suffix}}(RadiusNeighbors{{name_suffix}
# Use a generic implementation that handles most scipy
# metrics by computing the distances between 2 vectors at a time.
pda = RadiusNeighborsClassMode{{name_suffix}}(
datasets_pair=DatasetsPair{{name_suffix}}.get_for(X, Y, metric, metric_kwargs),
datasets_pair=DatasetsPair{{name_suffix}}.get_for(X, Y, precomputed, metric, metric_kwargs),
radius=radius,
chunk_size=chunk_size,
strategy=strategy,
Expand Down
Loading
Loading