Skip to content

ENH Pairwise Distances ArgKmin #21462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 322 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
322 commits
Select commit Hold shift + click to select a range
40e369d
Remove duplicate code to vector-to-ndarray coercion
jjerphan Jul 9, 2021
a3de08a
Use consistent names
jjerphan Jul 9, 2021
81b1b1b
Plug RadiusNeighborhood in RadiusNeighborsMixin.radius_neighbors
jjerphan Jul 9, 2021
b8fe6e1
Correctly free vectors using del
jjerphan Jul 12, 2021
54fb2c5
Use a sentinel for managing vectors' memory
jjerphan Jul 12, 2021
9df4047
Remove temporary consistency test
jjerphan Jul 12, 2021
7c713a1
Add comments
jjerphan Jul 12, 2021
7ea6daa
Merge pull request #3 from jjerphan/pairwise_aggregation_cython-radius
jjerphan Jul 13, 2021
8e54572
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Jul 13, 2021
1d3336d
Revert to 'euclidean' when 'fast_sqeuclidean' can't be used
jjerphan Jul 13, 2021
d96b163
Use 'fast_sqeuclidean' for Birch internals
jjerphan Jul 13, 2021
f660dbf
Fix PairwiseDistancesReduction.is_usable
jjerphan Jul 13, 2021
bb24d95
Make array C-ordered for test
jjerphan Jul 13, 2021
6eea1aa
Annotate with cython.final when relevant
jjerphan Jul 13, 2021
15c4150
Black contains all the color that I like
jjerphan Jul 15, 2021
a9706d6
Use relative imports
jjerphan Jul 15, 2021
0935975
Use method for flatten
jjerphan Jul 15, 2021
e2b5398
Correct cross-referencing for metrics.DistanceMetric
jjerphan Jul 15, 2021
ce1ccdc
Precise that p is the parameter used by 'minkowski'
jjerphan Jul 15, 2021
a9fe71f
Prefer assert_allclose over assert_array_equal
jjerphan Jul 20, 2021
1593dab
Prefer csr_matrix.toarray over csr_matrix.A
jjerphan Jul 20, 2021
01c1294
Rework test for correct behavior regarding the radius
jjerphan Jul 20, 2021
4b6a041
Inline heap pushes
jjerphan Jul 20, 2021
c26b583
Mirror ValueError for incorrectly set sort_results and return_distances
jjerphan Jul 20, 2021
5bcea9f
Parametrise test_radius_neighbors_graph_sparse
jjerphan Jul 20, 2021
474804c
Lighten and correct test
jjerphan Jul 21, 2021
e84ff1b
Allow other dtypes than np.float64
jjerphan Jul 21, 2021
ac2ce70
Parametrise test_kneighbors_graph_sparse
jjerphan Jul 21, 2021
9f612a1
fixup! Remove uncalled snippet
jjerphan Jul 21, 2021
74240bd
Mark a test case as xfail for test_fast_sqeuclidean_correctness
jjerphan Jul 21, 2021
32b08af
fixup! Inline heap pushes
jjerphan Jul 21, 2021
8cfabc8
fixup! Adapt cython submodule for heaps
jjerphan Jul 21, 2021
dc1079f
Do not push if the element is identical to the largest
jjerphan Jul 21, 2021
f5308b0
Remove X and Y from _reduce_on_chunks signature
jjerphan Jul 12, 2021
19d461f
Introduce DistanceMetric.sparse_{rdist,dist}
jjerphan Jul 22, 2021
444c4bc
Introduce DistanceComputer
jjerphan Jul 21, 2021
cf24fd0
Adapt tests for behavior on duplicates
jjerphan Jul 21, 2021
e7aaa71
Free datastructure if present and do not raise otherwise
jjerphan Jul 22, 2021
718a4c2
Adapt doctest to the new behavior
jjerphan Jul 22, 2021
b70db13
Monkey-patch neighbors to alias metrics.DistanceMetric
jjerphan Jul 22, 2021
39e8189
Revert "Make array C-ordered for test"
jjerphan Jul 22, 2021
71f0130
Makes test parametrisation execution order fixed
jjerphan Jul 22, 2021
cd7e1a2
Introduce _openmp_thread_num in helpers
jjerphan Jul 22, 2021
29e5c7b
Reorder macros, cython and python imports
jjerphan Jul 22, 2021
dbbb8e9
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Jul 22, 2021
550bb76
Improve comments for DistanceComputer
jjerphan Jul 27, 2021
0d5480c
Use 'proxy' instead for 'approx' as a wording
jjerphan Jul 27, 2021
305f007
Follow PEP 257
jjerphan Jul 27, 2021
ba89532
Fix RadiusNeighborhood.__dealloc__
jjerphan Jul 27, 2021
3b1e98c
Rename DistanceComputer for DatasetPairs
jjerphan Jul 27, 2021
7ae5091
Minimalistically document template methods
jjerphan Jul 27, 2021
2f9dc02
Reallocate datastructures for results at each new call
jjerphan Jul 27, 2021
4f3bd4c
Avoid thread over-subscription for BLAS
jjerphan Jul 27, 2021
9d6b83b
Introduce FastSquaredEuclideanRadiusNeighborhood
jjerphan Jul 28, 2021
a319358
Adapt internal checks
jjerphan Jul 28, 2021
3ebb200
Add test suite for PairwiseDistancesReduction
jjerphan Jul 28, 2021
69b0ad9
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Jul 28, 2021
f63692a
Merge vectors at the really end using dynamic scheduling
jjerphan Jul 28, 2021
5f1d3a0
Pull the distance metric up and make it readonly
jjerphan Jul 28, 2021
bd81245
Use proxy distance for RadiusNeighborhood reduction
jjerphan Jul 28, 2021
2e48051
fixup! Correct cross-referencing for metrics.DistanceMetric
jjerphan Jul 29, 2021
53ba89d
Improve inputs checks
jjerphan Jul 29, 2021
1632e14
Rename submodule for pairwise distances reductions
jjerphan Jul 29, 2021
906b1e4
Move DatasetsPair closer to DistanceMetrics
jjerphan Jul 29, 2021
eb988ea
Add const qualifier on squared norms memoryviews
jjerphan Jul 29, 2021
9fc5e95
Improve style
jjerphan Jul 29, 2021
4f73848
Improve docstring notse for FastSquaredEuclidean alternatives
jjerphan Jul 29, 2021
ea3d791
Fix __cinit__
jjerphan Jul 29, 2021
445a22d
Skip for tests for 32 bits
jjerphan Jul 29, 2021
2647327
Adapt tests for better parametrisation variability
jjerphan Jul 29, 2021
420dbc8
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Aug 4, 2021
4b2c165
Adaptation for DistanceMetrics
jjerphan Aug 4, 2021
3c71fd6
Improve style
jjerphan Aug 4, 2021
eaa892f
Parametrise tests by seed last
jjerphan Aug 4, 2021
7d5f603
Lighten tests parametrization
jjerphan Aug 4, 2021
91e3b27
Use squared norms of X vectors in FastSquaredEuclideanArgKmin
jjerphan Aug 4, 2021
c4add77
fixup! Adapt tests for better parametrisation variability
jjerphan Aug 4, 2021
caa7faa
Simplify tests
jjerphan Aug 4, 2021
c48c886
Improve test parametrisation on DistanceMetric
jjerphan Aug 4, 2021
4f06c3a
Improve heap routines' interfaces
jjerphan Aug 4, 2021
962f535
fixup! Improve heap routines' interfaces
jjerphan Aug 4, 2021
1b71282
Fix docstring
jjerphan Aug 4, 2021
dc8ddf4
Add missing dtype for indices
jjerphan Aug 4, 2021
5505f4f
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Aug 4, 2021
bd1b0d9
Better deprecate neighbors.DistanceMetric
jjerphan Aug 4, 2021
c0dbc97
Add link to CPython docs regarding reference stealing
jjerphan Aug 5, 2021
2b0d3a6
Force the coretype to be armv8 on linux-arm64
jjerphan Aug 5, 2021
fb3866c
Revert "Force the coretype to be armv8 on linux-arm64"
jjerphan Aug 6, 2021
e55fd94
Use conda-forge to test arm64
jjerphan Aug 6, 2021
50d2669
Use Mambaforge instead
jjerphan Aug 6, 2021
84c4315
Install all dependencies in a row via mamba
jjerphan Aug 6, 2021
2411ffc
Mark tests as xfail when in unstable OpenBLAS configuration
jjerphan Aug 6, 2021
b7bbd06
Lighten tests' parametrizations
jjerphan Aug 6, 2021
042e228
Improve checks for unstable OpenBLAS configuration
jjerphan Aug 6, 2021
394d9dc
fixup! Use Mambaforge instead
jjerphan Aug 6, 2021
b2d80dc
Check against now made privated in_unstable_openblas_configuration
jjerphan Aug 6, 2021
ae097cf
fixup! Use conda-forge to test arm64
jjerphan Aug 6, 2021
1821aad
Remove SparseEfficiencyWarning
jjerphan Aug 11, 2021
ca942a5
Apply suggestions from reviews comments and discussions
jjerphan Aug 11, 2021
89f909c
Correct string alignement
jjerphan Aug 11, 2021
6887a37
Improve comment for 32 bits fallback
jjerphan Aug 11, 2021
0832dc4
Remove some checks and interactions with python
jjerphan Aug 12, 2021
4fba30a
CI Specify latest lib versions for linux-arm64
jjerphan Aug 31, 2021
2a8c026
DOC Fix glossary
jjerphan Aug 31, 2021
f4c5b64
Remove neighbors.DistanceMetric.__init__
jjerphan Aug 31, 2021
dfd9661
Format Parameters section
jjerphan Aug 31, 2021
afc8bf8
Improve some docstrings and comments
jjerphan Aug 31, 2021
de2dbf6
Use libc.float.DBL_MAX instead of constant defined via macro
jjerphan Aug 31, 2021
36f7b6e
Change default metric for 'fast_sqeuclidean'
jjerphan Aug 31, 2021
9e7c8a0
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 1, 2021
16bf24a
Adapt error message in test
jjerphan Sep 1, 2021
5353794
Add guard against negative zeros when computing exact distances
jjerphan Sep 1, 2021
d7b984d
Introduce 'fast_euclidean' and adapt KNeighborsMixins accordingly
jjerphan Sep 1, 2021
6427883
[WIP] Use 'fast_sqeuclidean' instead when possible in KNeighborsMixins
jjerphan Sep 1, 2021
c865dc6
fixup! Introduce 'fast_euclidean' and adapt KNeighborsMixins accordingly
jjerphan Sep 2, 2021
be6741b
fixup! [WIP] Use 'fast_sqeuclidean' instead when possible in KNeighbo…
jjerphan Sep 2, 2021
e8664df
Fix NearestNeighbors docstring for Numpydoc
jjerphan Sep 2, 2021
339ab30
Use metric="fast_sqeuclidean" for pairwise_distances_argmin internal …
jjerphan Sep 2, 2021
f9e337c
Add n_threads on PairwiseDistancesReduction
jjerphan Sep 3, 2021
d14af8e
Pass n_threads on PairwiseDistancesReduction calls
jjerphan Sep 3, 2021
e8f0468
Factorise tests and add another for n_threads agnosticism
jjerphan Sep 3, 2021
7e5775c
Add docstring for RadiusNeighborhood.compute and improve others
jjerphan Sep 3, 2021
2d267b4
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 3, 2021
e9803de
Fix conjugation
jjerphan Sep 6, 2021
fb05746
Use correct wording
jjerphan Sep 6, 2021
5f14488
Format code à la black
jjerphan Sep 6, 2021
b71811c
Format docstring for 'auto' strategy
jjerphan Sep 6, 2021
9c819c8
Reword 'reduced distance' for 'rank-preserving surrogate distance'
jjerphan Sep 6, 2021
1ff2433
Look-up for the strategy in scikit-learn's configuration if not speci…
jjerphan Sep 6, 2021
0170af3
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 6, 2021
8530267
Clarify wording in comment regarding n_threads
jjerphan Sep 8, 2021
0365c34
Apply some small reviews suggestions
jjerphan Sep 9, 2021
e754b67
Remove redundant statement
jjerphan Sep 9, 2021
fe2f8de
Do not validate X and Y for same number of dimensions
jjerphan Sep 14, 2021
4c6253e
Do not validate X and Y for same number of dimensions
jjerphan Sep 14, 2021
7c86a39
Remove checks for CSR matrices in DatasetsPair
jjerphan Sep 14, 2021
b3efe85
Rename sparse_{dist,rdist} to csr_{dist,rdist}
jjerphan Sep 14, 2021
a81c2f8
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 14, 2021
a6f0e4a
fixup! Do not validate X and Y for same number of dimensions
jjerphan Sep 14, 2021
7231535
Don't use f-strings for docstrings
jjerphan Sep 14, 2021
faed7cc
Remove some checks on arrays
jjerphan Sep 23, 2021
7ca3d5e
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 23, 2021
66b60b8
Compute squared euclidean norm for rows in parallel
jjerphan Sep 24, 2021
90cb9fd
Validate arrays for C-contiguity where needed
jjerphan Sep 24, 2021
1ae16d7
Add tests for Neighbors-mixins subclasses
jjerphan Sep 27, 2021
e9dfc95
Xfail test on another numerical edge-case
jjerphan Sep 27, 2021
2dcac3f
Use PyArray_SetBaseObject via NumPy Cython API
jjerphan Sep 27, 2021
c5524f3
Apply review suggestions
jjerphan Sep 28, 2021
eda2b26
Revert whats_new entry.
jjerphan Sep 14, 2021
952d41a
Test the fast euclidean overriding
jjerphan Sep 29, 2021
86c0d6f
Mention distances computations and their reduction in dedicated method
jjerphan Sep 29, 2021
bd5a1db
fixup! Apply review suggestions
jjerphan Sep 29, 2021
1e2181d
Tight self.neigh_{indices,distances}'s lifetime to their composite
jjerphan Sep 29, 2021
3052db3
Adapt docstrings
jjerphan Sep 29, 2021
a758fc1
Factor compute in base class
jjerphan Sep 29, 2021
3dbe038
Do not use frenchism
jjerphan Sep 29, 2021
e8deb0f
Test fast metric alternatives fallbacks
jjerphan Sep 30, 2021
5911f1c
fixup! Test fast metric alternatives fallbacks
jjerphan Sep 30, 2021
9b9fb7c
Change warning message
jjerphan Sep 30, 2021
5fc91e1
Document and reword interfaces
jjerphan Oct 5, 2021
f75f08e
Better adapt the strategy for uniform weighting
jjerphan Oct 5, 2021
b484320
fixup! Better adapt the strategy for uniform weighting
jjerphan Oct 5, 2021
c134674
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 5, 2021
79786d0
Make pytest happy with proper checks on array-likes
jjerphan Oct 5, 2021
cbf40ea
Optimize squared norms' computations
jjerphan Oct 7, 2021
d25021e
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 7, 2021
55af187
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 12, 2021
bf37c3d
Import backport to avoid runtime introspection
jjerphan Oct 12, 2021
5660c0e
Clarify comments for sorting arrays
jjerphan Oct 12, 2021
072de9e
Correct indentation
jjerphan Oct 12, 2021
e191be2
Prefer "surrogate distance" as a naming
jjerphan Oct 12, 2021
775a10d
Use better notations for maths
jjerphan Oct 12, 2021
dab0d4c
Clarify the fast specialized alternative internals in the docstring
jjerphan Oct 12, 2021
b423d6a
Some more doc-string
jjerphan Oct 15, 2021
f6f76ce
Clean post-merge Circle CI script
jjerphan Oct 15, 2021
4078b0d
Doc Add whats_new entry for #20254
jjerphan Oct 15, 2021
ceff923
Better explain PairwiseDistancesArgKmin datastructures usage
jjerphan Oct 18, 2021
51caae2
fixup! Doc Add whats_new entry for #20254
jjerphan Oct 18, 2021
8613fd6
Better explain PairwiseDistancesRadiusNeighborhood behavior
jjerphan Oct 18, 2021
8b5e0b6
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 18, 2021
4bf7eee
Move changelog entry under the Miscellaneous section
jjerphan Oct 18, 2021
7fa4a40
Address review comments
jjerphan Oct 18, 2021
4a89d7f
Address review comments
jjerphan Oct 18, 2021
8f63e01
Fix config for 'pairwise_dist_chunk_size'
jjerphan Oct 20, 2021
2ad33ec
Delay and better scope arrays C ordering
jjerphan Oct 20, 2021
eba6f03
Simplify counting for remainder chunks
jjerphan Oct 23, 2021
34468ad
Better motivate heaps' parallel allocation
jjerphan Oct 25, 2021
fa424a4
Remove PairwiseDistancesRadiusNeighborhood
jjerphan Oct 25, 2021
5678666
Remove DatasetsPair used for sparse datasets
jjerphan Oct 25, 2021
45c7f6e
Add some general notes about the implementations
jjerphan Oct 26, 2021
0b85167
fixup! Remove PairwiseDistancesRadiusNeighborhood
jjerphan Oct 26, 2021
e7b0689
Turn off finitness checks in pairwise_distances_argmin{,_min}
jjerphan Oct 26, 2021
8d2a3d2
Improve test for pairwise_distances_argmin{,_min}
jjerphan Oct 26, 2021
843a894
Update whats_new entry
jjerphan Oct 26, 2021
effd897
Check for consistency when X_train is the query
jjerphan Oct 27, 2021
00577c5
Inject placeholder value for MeanShift.bandwidth
jjerphan Oct 27, 2021
b1338d5
Merge branch 'main' into pairwise-distances-argkmin
ogrisel Nov 2, 2021
4c3dd1f
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Nov 22, 2021
eab07b5
Rename PairwiseDistancesReduction callbacks
jjerphan Nov 22, 2021
8a48ffd
Link back to _openmp_effective_n_threads for n_threads' description
jjerphan Nov 22, 2021
83854fa
Use self.k directly
jjerphan Nov 22, 2021
445c860
Remove unneeded csr_dist and csr_rdist interfaces
jjerphan Nov 22, 2021
6a4d7fe
Add pairwise_dist_chunk_size keyword argument to config_context
jjerphan Nov 22, 2021
0ba2e39
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Nov 24, 2021
5399cc6
fixup! Rename PairwiseDistancesReduction callbacks
jjerphan Nov 24, 2021
d40b333
TST Refactor test and adapt checks and tolerances
jjerphan Nov 25, 2021
2f02350
TST Factorise fixtures for metric params
jjerphan Nov 25, 2021
19dd7ca
Change metric to fast_sqeuclidean for pairwise_distances_argmin*
jjerphan Nov 25, 2021
54b4b96
TST Remove spurious skip for tests
jjerphan Nov 25, 2021
56e86ef
TST Remove useless guard for haversine
jjerphan Nov 26, 2021
355cbe2
DOC Clarify docstrings and comments
jjerphan Nov 30, 2021
56151ab
MAINT Drop unneeded Cython directive
jjerphan Nov 30, 2021
96aaa0b
MAINT Better validate and use chunk_size
jjerphan Nov 30, 2021
9d5f7f7
MAINT Raise UserWarning when uneeded metric_params are specified
jjerphan Nov 30, 2021
5fa4cb1
Correctly fallback on standard metric
jjerphan Nov 30, 2021
7c36f11
TST Remove uneeded tests and use adapted version parsing
jjerphan Nov 30, 2021
d2396a7
Rename variable and fix docstring for the simultaneous swap
jjerphan Nov 30, 2021
02a0e92
fixup! MAINT Raise UserWarning when uneeded metric_params are specified
jjerphan Nov 30, 2021
e2e5282
fixup! DOC Clarify docstrings and comments
jjerphan Nov 30, 2021
1b77d2f
fixup! MAINT Drop unneeded Cython directive
jjerphan Nov 30, 2021
94facea
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Nov 30, 2021
bba8976
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Dec 1, 2021
f9037f0
[WIP] Rework PairwiseDistancesArgKmin.compute
jjerphan Dec 1, 2021
6983c32
DOC Add notes for `PairwiseDistancesArgKmin.compute`
jjerphan Dec 1, 2021
449545c
DOC Better word
jjerphan Dec 1, 2021
a6f9f9c
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Dec 15, 2021
0915a36
Remove 'fast_sqeuclidean' and 'fast_euclidean'
jjerphan Dec 15, 2021
09873f1
Merge pull request #6 from jjerphan/remove-fast-alternatives
jjerphan Dec 15, 2021
048b958
Revert uneeded changes
jjerphan Dec 15, 2021
720b807
Merge branch 'pairwise-distances-argkmin' into pairwise-distances-arg…
jjerphan Dec 16, 2021
7485f44
DOC Correct typos
jjerphan Dec 16, 2021
0f7a4e7
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Dec 21, 2021
7b6b399
Choose the strategy at initialisation
jjerphan Dec 21, 2021
5749c02
TST Adapt for the new `compute` interface
jjerphan Dec 21, 2021
2300c5e
Distinguish between effective and available threads
jjerphan Dec 21, 2021
a6ab85f
DOC Add and correct comments
jjerphan Dec 21, 2021
7435885
Update sklearn/metrics/_pairwise_distances_reduction.pyx
jjerphan Dec 21, 2021
27329c1
fixup! DOC Add and correct comments
jjerphan Dec 21, 2021
8754998
FIX Change strategy default value to None
jjerphan Dec 21, 2021
9cda95f
MAINT Rename n_threads variables and document them.
jjerphan Dec 22, 2021
5d77467
DOC Improve remark regarding support for discrete distance metrics
jjerphan Dec 22, 2021
af5a991
MAINT Remove useless private _compute method
jjerphan Dec 23, 2021
8a0771a
Merge pull request #5 from jjerphan/pairwise-distances-argkmin-raii
jjerphan Dec 23, 2021
dd1dde3
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Dec 23, 2021
decb029
Remove duplicated submodules registration
jjerphan Dec 23, 2021
527555d
fixup! Remove duplicated submodules registration
jjerphan Dec 23, 2021
c265944
Merge branch 'main' into pairwise-distances-argkmin
jjerphan Jan 24, 2022
25446ef
Merge branch 'main' into pairwise-distances-argkmin
ogrisel Feb 11, 2022
b11109c
Move changelog entry at the top
ogrisel Feb 11, 2022
5d6c9c2
Merge branch 'main' into pairwise-distances-argkmin
ogrisel Feb 15, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,34 @@ Changelog
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
where 123456 is the *pull request* number, not the issue number.

- |Efficiency| Low-level routines for reductions on pairwise distances
for dense float64 datasets have been refactored. The following functions
and estimators now benefit from improved performances, in particular on
multi-cores machines:
- :func:`sklearn.metrics.pairwise_distances_argmin`
- :func:`sklearn.metrics.pairwise_distances_argmin_min`
- :class:`sklearn.cluster.AffinityPropagation`
- :class:`sklearn.cluster.Birch`
- :class:`sklearn.cluster.MeanShift`
- :class:`sklearn.cluster.OPTICS`
- :class:`sklearn.cluster.SpectralClustering`
- :func:`sklearn.feature_selection.mutual_info_regression`
- :class:`sklearn.neighbors.KNeighborsClassifier`
- :class:`sklearn.neighbors.KNeighborsRegressor`
- :class:`sklearn.neighbors.LocalOutlierFactor`
- :class:`sklearn.neighbors.NearestNeighbors`
- :class:`sklearn.manifold.Isomap`
- :class:`sklearn.manifold.LocallyLinearEmbedding`
- :class:`sklearn.manifold.TSNE`
- :func:`sklearn.manifold.trustworthiness`
- :class:`sklearn.semi_supervised.LabelPropagation`
- :class:`sklearn.semi_supervised.LabelSpreading`

For instance :class:`sklearn.neighbors.NearestNeighbors.kneighbors`
can be up to 20× faster than in the previous versions'.

:pr:`21462` by :user:`Julien Jerphanion <jjerphan>`.

- |Enhancement| All scikit-learn models now generate a more informative
error message when some input contains unexpected `NaN` or infinite values.
In particular the message contains the input name ("X", "y" or
Expand Down
30 changes: 28 additions & 2 deletions sklearn/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
"working_memory": int(os.environ.get("SKLEARN_WORKING_MEMORY", 1024)),
"print_changed_only": True,
"display": "text",
"pairwise_dist_chunk_size": int(
os.environ.get("SKLEARN_PAIRWISE_DIST_CHUNK_SIZE", 256)
),
}
_threadlocal = threading.local()

Expand Down Expand Up @@ -40,7 +43,11 @@ def get_config():


def set_config(
assume_finite=None, working_memory=None, print_changed_only=None, display=None
assume_finite=None,
working_memory=None,
print_changed_only=None,
display=None,
pairwise_dist_chunk_size=None,
):
"""Set global scikit-learn configuration

Expand Down Expand Up @@ -80,6 +87,12 @@ def set_config(

.. versionadded:: 0.23

pairwise_dist_chunk_size : int, default=None
The number of vectors per chunk for PairwiseDistancesReduction.
Default is 256 (suitable for most of modern laptops' caches and architectures).

.. versionadded:: 1.1

See Also
--------
config_context : Context manager for global scikit-learn configuration.
Expand All @@ -95,11 +108,18 @@ def set_config(
local_config["print_changed_only"] = print_changed_only
if display is not None:
local_config["display"] = display
if pairwise_dist_chunk_size is not None:
local_config["pairwise_dist_chunk_size"] = pairwise_dist_chunk_size


@contextmanager
def config_context(
*, assume_finite=None, working_memory=None, print_changed_only=None, display=None
*,
assume_finite=None,
working_memory=None,
print_changed_only=None,
display=None,
pairwise_dist_chunk_size=None,
):
"""Context manager for global scikit-learn configuration.

Expand Down Expand Up @@ -138,6 +158,12 @@ def config_context(

.. versionadded:: 0.23

pairwise_dist_chunk_size : int, default=None
The number of vectors per chunk for PairwiseDistancesReduction.
Default is 256 (suitable for most of modern laptops' caches and architectures).

.. versionadded:: 1.1

Yields
------
None.
Expand Down
21 changes: 21 additions & 0 deletions sklearn/metrics/_dist_metrics.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,24 @@ cdef class DistanceMetric:
cdef DTYPE_t _rdist_to_dist(self, DTYPE_t rdist) nogil except -1

cdef DTYPE_t _dist_to_rdist(self, DTYPE_t dist) nogil except -1


######################################################################
# DatasetsPair base class
cdef class DatasetsPair:
cdef DistanceMetric distance_metric

cdef ITYPE_t n_samples_X(self) nogil

cdef ITYPE_t n_samples_Y(self) nogil

cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil

cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil


cdef class DenseDenseDatasetsPair(DatasetsPair):
cdef:
const DTYPE_t[:, ::1] X
const DTYPE_t[:, ::1] Y
ITYPE_t d
189 changes: 180 additions & 9 deletions sklearn/metrics/_dist_metrics.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

import numpy as np
cimport numpy as np
from cython cimport final

np.import_array() # required in order to use C-API


Expand All @@ -23,10 +25,10 @@ cdef inline np.ndarray _buffer_to_ndarray(const DTYPE_t* x, np.npy_intp n):
return PyArray_SimpleNewFromData(1, &n, DTYPECODE, <void*>x)


# some handy constants
from libc.math cimport fabs, sqrt, exp, pow, cos, sin, asin
cdef DTYPE_t INF = np.inf

from scipy.sparse import csr_matrix, issparse
from ..utils._typedefs cimport DTYPE_t, ITYPE_t, DITYPE_t, DTYPECODE
from ..utils._typedefs import DTYPE, ITYPE
from ..utils._readonly_array_wrapper import ReadonlyArrayWrapper
Expand Down Expand Up @@ -68,6 +70,17 @@ METRIC_MAPPING = {'euclidean': EuclideanDistance,
'haversine': HaversineDistance,
'pyfunc': PyFuncDistance}

BOOL_METRICS = [
"hamming",
"matching",
"jaccard",
"dice",
"kulsinski",
"rogerstanimoto",
"russellrao",
"sokalmichener",
"sokalsneath",
]

def get_valid_metric_ids(L):
"""Given an iterable of metric class names or class identifiers,
Expand Down Expand Up @@ -199,8 +212,8 @@ cdef class DistanceMetric:
"""
def __cinit__(self):
self.p = 2
self.vec = np.zeros(1, dtype=DTYPE, order='c')
self.mat = np.zeros((1, 1), dtype=DTYPE, order='c')
self.vec = np.zeros(1, dtype=DTYPE, order='C')
self.mat = np.zeros((1, 1), dtype=DTYPE, order='C')
self.size = 1

def __reduce__(self):
Expand Down Expand Up @@ -299,8 +312,9 @@ cdef class DistanceMetric:
This can optionally be overridden in a base class.

The rank-preserving surrogate distance is any measure that yields the same
rank as the distance, but is more efficient to compute. For example, for the
Euclidean metric, the surrogate distance is the squared-euclidean distance.
rank as the distance, but is more efficient to compute. For example, the
rank-preserving surrogate distance of the Euclidean metric is the
squared-euclidean distance.
"""
return self.dist(x1, x2, size)

Expand Down Expand Up @@ -336,8 +350,9 @@ cdef class DistanceMetric:
"""Convert the rank-preserving surrogate distance to the distance.

The surrogate distance is any measure that yields the same rank as the
distance, but is more efficient to compute. For example, for the
Euclidean metric, the surrogate distance is the squared-euclidean distance.
distance, but is more efficient to compute. For example, the
rank-preserving surrogate distance of the Euclidean metric is the
squared-euclidean distance.

Parameters
----------
Expand All @@ -355,8 +370,9 @@ cdef class DistanceMetric:
"""Convert the true distance to the rank-preserving surrogate distance.

The surrogate distance is any measure that yields the same rank as the
distance, but is more efficient to compute. For example, for the
Euclidean metric, the surrogate distance is the squared-euclidean distance.
distance, but is more efficient to compute. For example, the
rank-preserving surrogate distance of the Euclidean metric is the
squared-euclidean distance.

Parameters
----------
Expand Down Expand Up @@ -1191,3 +1207,158 @@ cdef class PyFuncDistance(DistanceMetric):

cdef inline double fmax(double a, double b) nogil:
return max(a, b)


######################################################################
# Datasets Pair Classes
cdef class DatasetsPair:
"""Abstract class which wraps a pair of datasets (X, Y).

This class allows computing distances between a single pair of rows of
of X and Y at a time given the pair of their indices (i, j). This class is
specialized for each metric thanks to the :func:`get_for` factory classmethod.

The handling of parallelization over chunks to compute the distances
and aggregation for several rows at a time is done in dedicated
subclasses of PairwiseDistancesReduction that in-turn rely on
subclasses of DatasetsPair for each pair of rows in the data. The goal
is to make it possible to decouple the generic parallelization and
aggregation logic from metric-specific computation as much as
possible.

X and Y can be stored as np.ndarrays or CSR matrices in subclasses.

This class avoids the overhead of dispatching distance computations
to :class:`sklearn.metrics.DistanceMetric` based on the physical
representation of the vectors (sparse vs. dense). It makes use of
cython.final to remove the overhead of dispatching method calls.

Parameters
----------
distance_metric: DistanceMetric
The distance metric responsible for computing distances
between two vectors of (X, Y).
"""

@classmethod
def get_for(
cls,
X,
Y,
str metric="euclidean",
dict metric_kwargs=None,
) -> DatasetsPair:
"""Return the DatasetsPair implementation for the given arguments.

Parameters
----------
X : {ndarray, sparse matrix} of shape (n_samples_X, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.

Y : {ndarray, sparse matrix} of shape (n_samples_Y, n_features)
Input data.
If provided as a ndarray, it must be C-contiguous.
If provided as a sparse matrix, it must be in CSR format.

metric : str, default='euclidean'
The distance metric to use for argkmin. The default metric is
a fast implementation of the standard Euclidean metric.
For a list of available metrics, see the documentation of
:class:`~sklearn.metrics.DistanceMetric`.

metric_kwargs : dict, default=None
Keyword arguments to pass to specified metric function.

Returns
-------
datasets_pair: DatasetsPair
The suited DatasetsPair implementation.
"""
cdef:
DistanceMetric distance_metric = DistanceMetric.get_metric(
metric,
**(metric_kwargs or {})
)

if X.dtype != np.float64 or Y.dtype != np.float64:
raise ValueError("Only 64bit float datasets are supported for X and Y.")

# Metric-specific checks that do not replace nor duplicate `check_array`.
distance_metric._validate_data(X)
distance_metric._validate_data(Y)

if issparse(X) or issparse(Y):
raise ValueError("Only dense datasets are supported for X and Y.")

return DenseDenseDatasetsPair(X, Y, distance_metric)

@classmethod
def unpack_csr_matrix(cls, X: csr_matrix):
"""Ensure getting ITYPE instead of int internally used for CSR matrices."""
X_data = np.asarray(X.data, dtype=DTYPE)
X_indices = np.asarray(X.indices, dtype=ITYPE)
X_indptr = np.asarray(X.indptr, dtype=ITYPE)
return X_data, X_indptr, X_indptr

def __init__(self, DistanceMetric distance_metric):
self.distance_metric = distance_metric

cdef ITYPE_t n_samples_X(self) nogil:
"""Number of samples in X."""
return -999

cdef ITYPE_t n_samples_Y(self) nogil:
"""Number of samples in Y."""
return -999

cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.dist(i, j)

cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
return -1

@final
cdef class DenseDenseDatasetsPair(DatasetsPair):
"""Compute distances between vectors of two arrays.

Parameters
----------
X: ndarray of shape (n_samples_X, n_features)
Rows represent vectors. Must be C-contiguous.

Y: ndarray of shape (n_samples_Y, n_features)
Rows represent vectors. Must be C-contiguous.

distance_metric: DistanceMetric
The distance metric responsible for computing distances
between two vectors of (X, Y).
"""

def __init__(self, X, Y, DistanceMetric distance_metric):
super().__init__(distance_metric)
# Arrays have already been checked
self.X = X
self.Y = Y
self.d = X.shape[1]

@final
cdef ITYPE_t n_samples_X(self) nogil:
return self.X.shape[0]

@final
cdef ITYPE_t n_samples_Y(self) nogil:
return self.Y.shape[0]

@final
cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.rdist(&self.X[i, 0],
&self.Y[j, 0],
self.d)

@final
cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.dist(&self.X[i, 0],
&self.Y[j, 0],
self.d)
Loading