Skip to content

ENH Private Cython Submodule for Reduction over Pairwise Distances #20254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 261 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
261 commits
Select commit Hold shift + click to select a range
cb85791
Adapt cython submodule for heaps
jjerphan Jun 23, 2021
5abda94
Reintroduce deleted test_neighbors_heap
jjerphan Jun 23, 2021
bc8925e
Lint
jjerphan Jun 23, 2021
e2bb562
Minify utils._heap definition file
jjerphan Jun 23, 2021
9e9065d
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Jun 24, 2021
ac76852
Post-merge black code formatting
jjerphan Jun 24, 2021
cac7313
Spread datasets for the tests of the fast_sqeuclidean strategy
jjerphan Jun 24, 2021
8a06c3f
Rectify test
jjerphan Jun 30, 2021
41bd644
[WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
568ed2a
[WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
2bf34aa
[WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
c1415d6
[WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
80aaf0b
[WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
49e247d
[WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
25c9a2c
fixup! [WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
eb8b931
[WIP] Adapting to use class hierarchy
jjerphan Jun 30, 2021
e0d1c99
Move neighbors.DistanceMetric to metrics
jjerphan Jun 25, 2021
ac5ddc1
Rename private submodule to _parallel_reductions
jjerphan Jun 30, 2021
4e7b3cb
Introduce DistanceMetric and ArgKmin factory method
jjerphan Jun 30, 2021
d386fe1
Support DistanceMetric's and branch on ArgKmin when possible
jjerphan Jun 30, 2021
a61e81f
Document GEMM call and change wording to "approximated distance"
jjerphan Jul 1, 2021
e0d2881
Support memmapped and integral arrays
jjerphan Jul 1, 2021
b23f972
Pull _parallel_on_{X,Y} up on ParallelReduction
jjerphan Jul 1, 2021
67e02d4
Pull GEMM buffers down to FastSquaredEuclideanArgKmin
jjerphan Jul 1, 2021
a8331da
Skip tests for translation invariance
jjerphan Jul 1, 2021
0617dbd
Define methods on the base class
jjerphan Jul 1, 2021
38b97c3
Improve tests for NearestNeighbors the algorithm
jjerphan Jul 1, 2021
ef9c7f6
Propagate metric kwargs in KNeighborsMixin.kneighbors
jjerphan Jul 1, 2021
882e6e7
Add DistanceMetric data validation at initialisation
jjerphan Jul 1, 2021
15c110a
Remove warning checks for 'wminkowski' now that Scipy is not used
jjerphan Jul 1, 2021
7e3c4b7
Parametrise test_k_and_radius_neighbors_duplicates on algorithms
jjerphan Jul 1, 2021
2ec36c1
Remove uncalled snippet
jjerphan Jul 1, 2021
ad496f0
Do not branch on sparse arrays, yet
jjerphan Jul 1, 2021
3cdc476
Merge pull request #2 from jjerphan/pairwise_aggregation_cython-oop
jjerphan Jul 1, 2021
68af6a7
Document
jjerphan Jul 2, 2021
f740334
Remove unnecessary parallel_on_X thread-local datastructures
jjerphan Jul 2, 2021
0b9732b
Remove attributes for dtypes' size
jjerphan Jul 2, 2021
5c57860
Use all threads when sorting
jjerphan Jul 2, 2021
384b9a8
Rename ParallelReduction to PairwiseDistancesReduction
jjerphan Jul 2, 2021
ed03b88
Cast pointer to const value for gemm interface
jjerphan Jul 5, 2021
6c5e0b9
[WIP] Add RadiusNeighborhood
jjerphan Jul 1, 2021
e37b147
[WIP] Add RadiusNeighborhood
jjerphan Jul 1, 2021
29d145a
[WIP] Add RadiusNeighborhood
jjerphan Jul 1, 2021
dcca503
[WIP] Add RadiusNeighborhood
jjerphan Jul 1, 2021
eaedf53
Excluse some distances from valid ones
jjerphan Jul 5, 2021
70b13cd
Add temporary consistency test
jjerphan Jul 5, 2021
0f87e6f
Move results allocation from initialisation to compute
jjerphan Jul 5, 2021
633d3f4
Address review comments
jjerphan Jul 7, 2021
f5d2915
Exclude some DistanceMetrics for PairwiseDistancesReduction
jjerphan Jul 7, 2021
39c4788
Introduce utils functions for vector to ndarray coercion
jjerphan Jul 8, 2021
b58321f
Introduce parallel_on_Y for RadiusNeighborhood
jjerphan Jul 8, 2021
c24d184
Sort returned valid distances
jjerphan Jul 9, 2021
1ee3ae1
Update comments
jjerphan Jul 9, 2021
6ecf3f3
Change template for parallel_on_X
jjerphan Jul 9, 2021
4e0c465
Fix number of threads for parallel_on_Y
jjerphan Jul 9, 2021
30dcbba
Adapt test for 'brute' algorithm
jjerphan Jul 9, 2021
1a25476
Only allocate temporary buffer once
jjerphan Jul 9, 2021
93db1ee
Remove called to Argkmin.__dealloc__ in subclass'
jjerphan Jul 9, 2021
12d7dfd
[WIP] Introduce parallel_on_Y for RadiusNeighborhood
jjerphan Jul 9, 2021
937c4f3
Introduce parallel_on_Y for RadiusNeighborhood
jjerphan Jul 9, 2021
9780f28
Remove unnecessary intermediary sorts
jjerphan Jul 9, 2021
40e369d
Remove duplicate code to vector-to-ndarray coercion
jjerphan Jul 9, 2021
a3de08a
Use consistent names
jjerphan Jul 9, 2021
81b1b1b
Plug RadiusNeighborhood in RadiusNeighborsMixin.radius_neighbors
jjerphan Jul 9, 2021
b8fe6e1
Correctly free vectors using del
jjerphan Jul 12, 2021
54fb2c5
Use a sentinel for managing vectors' memory
jjerphan Jul 12, 2021
9df4047
Remove temporary consistency test
jjerphan Jul 12, 2021
7c713a1
Add comments
jjerphan Jul 12, 2021
7ea6daa
Merge pull request #3 from jjerphan/pairwise_aggregation_cython-radius
jjerphan Jul 13, 2021
8e54572
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Jul 13, 2021
1d3336d
Revert to 'euclidean' when 'fast_sqeuclidean' can't be used
jjerphan Jul 13, 2021
d96b163
Use 'fast_sqeuclidean' for Birch internals
jjerphan Jul 13, 2021
f660dbf
Fix PairwiseDistancesReduction.is_usable
jjerphan Jul 13, 2021
bb24d95
Make array C-ordered for test
jjerphan Jul 13, 2021
6eea1aa
Annotate with cython.final when relevant
jjerphan Jul 13, 2021
15c4150
Black contains all the color that I like
jjerphan Jul 15, 2021
a9706d6
Use relative imports
jjerphan Jul 15, 2021
0935975
Use method for flatten
jjerphan Jul 15, 2021
e2b5398
Correct cross-referencing for metrics.DistanceMetric
jjerphan Jul 15, 2021
ce1ccdc
Precise that p is the parameter used by 'minkowski'
jjerphan Jul 15, 2021
a9fe71f
Prefer assert_allclose over assert_array_equal
jjerphan Jul 20, 2021
1593dab
Prefer csr_matrix.toarray over csr_matrix.A
jjerphan Jul 20, 2021
01c1294
Rework test for correct behavior regarding the radius
jjerphan Jul 20, 2021
4b6a041
Inline heap pushes
jjerphan Jul 20, 2021
c26b583
Mirror ValueError for incorrectly set sort_results and return_distances
jjerphan Jul 20, 2021
5bcea9f
Parametrise test_radius_neighbors_graph_sparse
jjerphan Jul 20, 2021
474804c
Lighten and correct test
jjerphan Jul 21, 2021
e84ff1b
Allow other dtypes than np.float64
jjerphan Jul 21, 2021
ac2ce70
Parametrise test_kneighbors_graph_sparse
jjerphan Jul 21, 2021
9f612a1
fixup! Remove uncalled snippet
jjerphan Jul 21, 2021
74240bd
Mark a test case as xfail for test_fast_sqeuclidean_correctness
jjerphan Jul 21, 2021
32b08af
fixup! Inline heap pushes
jjerphan Jul 21, 2021
8cfabc8
fixup! Adapt cython submodule for heaps
jjerphan Jul 21, 2021
dc1079f
Do not push if the element is identical to the largest
jjerphan Jul 21, 2021
f5308b0
Remove X and Y from _reduce_on_chunks signature
jjerphan Jul 12, 2021
19d461f
Introduce DistanceMetric.sparse_{rdist,dist}
jjerphan Jul 22, 2021
444c4bc
Introduce DistanceComputer
jjerphan Jul 21, 2021
cf24fd0
Adapt tests for behavior on duplicates
jjerphan Jul 21, 2021
e7aaa71
Free datastructure if present and do not raise otherwise
jjerphan Jul 22, 2021
718a4c2
Adapt doctest to the new behavior
jjerphan Jul 22, 2021
b70db13
Monkey-patch neighbors to alias metrics.DistanceMetric
jjerphan Jul 22, 2021
39e8189
Revert "Make array C-ordered for test"
jjerphan Jul 22, 2021
71f0130
Makes test parametrisation execution order fixed
jjerphan Jul 22, 2021
cd7e1a2
Introduce _openmp_thread_num in helpers
jjerphan Jul 22, 2021
29e5c7b
Reorder macros, cython and python imports
jjerphan Jul 22, 2021
dbbb8e9
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Jul 22, 2021
550bb76
Improve comments for DistanceComputer
jjerphan Jul 27, 2021
0d5480c
Use 'proxy' instead for 'approx' as a wording
jjerphan Jul 27, 2021
305f007
Follow PEP 257
jjerphan Jul 27, 2021
ba89532
Fix RadiusNeighborhood.__dealloc__
jjerphan Jul 27, 2021
3b1e98c
Rename DistanceComputer for DatasetPairs
jjerphan Jul 27, 2021
7ae5091
Minimalistically document template methods
jjerphan Jul 27, 2021
2f9dc02
Reallocate datastructures for results at each new call
jjerphan Jul 27, 2021
4f3bd4c
Avoid thread over-subscription for BLAS
jjerphan Jul 27, 2021
9d6b83b
Introduce FastSquaredEuclideanRadiusNeighborhood
jjerphan Jul 28, 2021
a319358
Adapt internal checks
jjerphan Jul 28, 2021
3ebb200
Add test suite for PairwiseDistancesReduction
jjerphan Jul 28, 2021
69b0ad9
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Jul 28, 2021
f63692a
Merge vectors at the really end using dynamic scheduling
jjerphan Jul 28, 2021
5f1d3a0
Pull the distance metric up and make it readonly
jjerphan Jul 28, 2021
bd81245
Use proxy distance for RadiusNeighborhood reduction
jjerphan Jul 28, 2021
2e48051
fixup! Correct cross-referencing for metrics.DistanceMetric
jjerphan Jul 29, 2021
53ba89d
Improve inputs checks
jjerphan Jul 29, 2021
1632e14
Rename submodule for pairwise distances reductions
jjerphan Jul 29, 2021
906b1e4
Move DatasetsPair closer to DistanceMetrics
jjerphan Jul 29, 2021
eb988ea
Add const qualifier on squared norms memoryviews
jjerphan Jul 29, 2021
9fc5e95
Improve style
jjerphan Jul 29, 2021
4f73848
Improve docstring notse for FastSquaredEuclidean alternatives
jjerphan Jul 29, 2021
ea3d791
Fix __cinit__
jjerphan Jul 29, 2021
445a22d
Skip for tests for 32 bits
jjerphan Jul 29, 2021
2647327
Adapt tests for better parametrisation variability
jjerphan Jul 29, 2021
420dbc8
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Aug 4, 2021
4b2c165
Adaptation for DistanceMetrics
jjerphan Aug 4, 2021
3c71fd6
Improve style
jjerphan Aug 4, 2021
eaa892f
Parametrise tests by seed last
jjerphan Aug 4, 2021
7d5f603
Lighten tests parametrization
jjerphan Aug 4, 2021
91e3b27
Use squared norms of X vectors in FastSquaredEuclideanArgKmin
jjerphan Aug 4, 2021
c4add77
fixup! Adapt tests for better parametrisation variability
jjerphan Aug 4, 2021
caa7faa
Simplify tests
jjerphan Aug 4, 2021
c48c886
Improve test parametrisation on DistanceMetric
jjerphan Aug 4, 2021
4f06c3a
Improve heap routines' interfaces
jjerphan Aug 4, 2021
962f535
fixup! Improve heap routines' interfaces
jjerphan Aug 4, 2021
1b71282
Fix docstring
jjerphan Aug 4, 2021
dc8ddf4
Add missing dtype for indices
jjerphan Aug 4, 2021
5505f4f
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Aug 4, 2021
bd1b0d9
Better deprecate neighbors.DistanceMetric
jjerphan Aug 4, 2021
c0dbc97
Add link to CPython docs regarding reference stealing
jjerphan Aug 5, 2021
2b0d3a6
Force the coretype to be armv8 on linux-arm64
jjerphan Aug 5, 2021
fb3866c
Revert "Force the coretype to be armv8 on linux-arm64"
jjerphan Aug 6, 2021
e55fd94
Use conda-forge to test arm64
jjerphan Aug 6, 2021
50d2669
Use Mambaforge instead
jjerphan Aug 6, 2021
84c4315
Install all dependencies in a row via mamba
jjerphan Aug 6, 2021
2411ffc
Mark tests as xfail when in unstable OpenBLAS configuration
jjerphan Aug 6, 2021
b7bbd06
Lighten tests' parametrizations
jjerphan Aug 6, 2021
042e228
Improve checks for unstable OpenBLAS configuration
jjerphan Aug 6, 2021
394d9dc
fixup! Use Mambaforge instead
jjerphan Aug 6, 2021
b2d80dc
Check against now made privated in_unstable_openblas_configuration
jjerphan Aug 6, 2021
ae097cf
fixup! Use conda-forge to test arm64
jjerphan Aug 6, 2021
1821aad
Remove SparseEfficiencyWarning
jjerphan Aug 11, 2021
ca942a5
Apply suggestions from reviews comments and discussions
jjerphan Aug 11, 2021
89f909c
Correct string alignement
jjerphan Aug 11, 2021
6887a37
Improve comment for 32 bits fallback
jjerphan Aug 11, 2021
0832dc4
Remove some checks and interactions with python
jjerphan Aug 12, 2021
4fba30a
CI Specify latest lib versions for linux-arm64
jjerphan Aug 31, 2021
2a8c026
DOC Fix glossary
jjerphan Aug 31, 2021
f4c5b64
Remove neighbors.DistanceMetric.__init__
jjerphan Aug 31, 2021
dfd9661
Format Parameters section
jjerphan Aug 31, 2021
afc8bf8
Improve some docstrings and comments
jjerphan Aug 31, 2021
de2dbf6
Use libc.float.DBL_MAX instead of constant defined via macro
jjerphan Aug 31, 2021
36f7b6e
Change default metric for 'fast_sqeuclidean'
jjerphan Aug 31, 2021
9e7c8a0
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 1, 2021
16bf24a
Adapt error message in test
jjerphan Sep 1, 2021
5353794
Add guard against negative zeros when computing exact distances
jjerphan Sep 1, 2021
d7b984d
Introduce 'fast_euclidean' and adapt KNeighborsMixins accordingly
jjerphan Sep 1, 2021
6427883
[WIP] Use 'fast_sqeuclidean' instead when possible in KNeighborsMixins
jjerphan Sep 1, 2021
c865dc6
fixup! Introduce 'fast_euclidean' and adapt KNeighborsMixins accordingly
jjerphan Sep 2, 2021
be6741b
fixup! [WIP] Use 'fast_sqeuclidean' instead when possible in KNeighbo…
jjerphan Sep 2, 2021
e8664df
Fix NearestNeighbors docstring for Numpydoc
jjerphan Sep 2, 2021
339ab30
Use metric="fast_sqeuclidean" for pairwise_distances_argmin internal …
jjerphan Sep 2, 2021
f9e337c
Add n_threads on PairwiseDistancesReduction
jjerphan Sep 3, 2021
d14af8e
Pass n_threads on PairwiseDistancesReduction calls
jjerphan Sep 3, 2021
e8f0468
Factorise tests and add another for n_threads agnosticism
jjerphan Sep 3, 2021
7e5775c
Add docstring for RadiusNeighborhood.compute and improve others
jjerphan Sep 3, 2021
2d267b4
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 3, 2021
e9803de
Fix conjugation
jjerphan Sep 6, 2021
fb05746
Use correct wording
jjerphan Sep 6, 2021
5f14488
Format code à la black
jjerphan Sep 6, 2021
b71811c
Format docstring for 'auto' strategy
jjerphan Sep 6, 2021
9c819c8
Reword 'reduced distance' for 'rank-preserving surrogate distance'
jjerphan Sep 6, 2021
1ff2433
Look-up for the strategy in scikit-learn's configuration if not speci…
jjerphan Sep 6, 2021
0170af3
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 6, 2021
8530267
Clarify wording in comment regarding n_threads
jjerphan Sep 8, 2021
0365c34
Apply some small reviews suggestions
jjerphan Sep 9, 2021
e754b67
Remove redundant statement
jjerphan Sep 9, 2021
fe2f8de
Do not validate X and Y for same number of dimensions
jjerphan Sep 14, 2021
4c6253e
Do not validate X and Y for same number of dimensions
jjerphan Sep 14, 2021
7c86a39
Remove checks for CSR matrices in DatasetsPair
jjerphan Sep 14, 2021
b3efe85
Rename sparse_{dist,rdist} to csr_{dist,rdist}
jjerphan Sep 14, 2021
a81c2f8
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 14, 2021
a6f0e4a
fixup! Do not validate X and Y for same number of dimensions
jjerphan Sep 14, 2021
7231535
Don't use f-strings for docstrings
jjerphan Sep 14, 2021
faed7cc
Remove some checks on arrays
jjerphan Sep 23, 2021
7ca3d5e
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Sep 23, 2021
66b60b8
Compute squared euclidean norm for rows in parallel
jjerphan Sep 24, 2021
90cb9fd
Validate arrays for C-contiguity where needed
jjerphan Sep 24, 2021
1ae16d7
Add tests for Neighbors-mixins subclasses
jjerphan Sep 27, 2021
e9dfc95
Xfail test on another numerical edge-case
jjerphan Sep 27, 2021
2dcac3f
Use PyArray_SetBaseObject via NumPy Cython API
jjerphan Sep 27, 2021
c5524f3
Apply review suggestions
jjerphan Sep 28, 2021
eda2b26
Revert whats_new entry.
jjerphan Sep 14, 2021
952d41a
Test the fast euclidean overriding
jjerphan Sep 29, 2021
86c0d6f
Mention distances computations and their reduction in dedicated method
jjerphan Sep 29, 2021
bd5a1db
fixup! Apply review suggestions
jjerphan Sep 29, 2021
1e2181d
Tight self.neigh_{indices,distances}'s lifetime to their composite
jjerphan Sep 29, 2021
3052db3
Adapt docstrings
jjerphan Sep 29, 2021
a758fc1
Factor compute in base class
jjerphan Sep 29, 2021
3dbe038
Do not use frenchism
jjerphan Sep 29, 2021
e8deb0f
Test fast metric alternatives fallbacks
jjerphan Sep 30, 2021
5911f1c
fixup! Test fast metric alternatives fallbacks
jjerphan Sep 30, 2021
9b9fb7c
Change warning message
jjerphan Sep 30, 2021
5fc91e1
Document and reword interfaces
jjerphan Oct 5, 2021
f75f08e
Better adapt the strategy for uniform weighting
jjerphan Oct 5, 2021
b484320
fixup! Better adapt the strategy for uniform weighting
jjerphan Oct 5, 2021
c134674
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 5, 2021
79786d0
Make pytest happy with proper checks on array-likes
jjerphan Oct 5, 2021
cbf40ea
Optimize squared norms' computations
jjerphan Oct 7, 2021
d25021e
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 7, 2021
55af187
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 12, 2021
bf37c3d
Import backport to avoid runtime introspection
jjerphan Oct 12, 2021
5660c0e
Clarify comments for sorting arrays
jjerphan Oct 12, 2021
072de9e
Correct indentation
jjerphan Oct 12, 2021
e191be2
Prefer "surrogate distance" as a naming
jjerphan Oct 12, 2021
775a10d
Use better notations for maths
jjerphan Oct 12, 2021
dab0d4c
Clarify the fast specialized alternative internals in the docstring
jjerphan Oct 12, 2021
b423d6a
Some more doc-string
jjerphan Oct 15, 2021
f6f76ce
Clean post-merge Circle CI script
jjerphan Oct 15, 2021
4078b0d
Doc Add whats_new entry for #20254
jjerphan Oct 15, 2021
ceff923
Better explain PairwiseDistancesArgKmin datastructures usage
jjerphan Oct 18, 2021
51caae2
fixup! Doc Add whats_new entry for #20254
jjerphan Oct 18, 2021
8613fd6
Better explain PairwiseDistancesRadiusNeighborhood behavior
jjerphan Oct 18, 2021
8b5e0b6
Merge branch 'main' into pairwise_aggregation_cython
jjerphan Oct 18, 2021
4bf7eee
Move changelog entry under the Miscellaneous section
jjerphan Oct 18, 2021
7fa4a40
Address review comments
jjerphan Oct 18, 2021
4a89d7f
Address review comments
jjerphan Oct 18, 2021
8f63e01
Fix config for 'pairwise_dist_chunk_size'
jjerphan Oct 20, 2021
2ad33ec
Delay and better scope arrays C ordering
jjerphan Oct 20, 2021
eba6f03
Simplify counting for remainder chunks
jjerphan Oct 23, 2021
34468ad
Better motivate heaps' parallel allocation
jjerphan Oct 25, 2021
fa424a4
Remove PairwiseDistancesRadiusNeighborhood
jjerphan Oct 25, 2021
5678666
Remove DatasetsPair used for sparse datasets
jjerphan Oct 25, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,39 @@ Changelog
:pr:`20880` by :user:`Guillaume Lemaitre <glemaitre>`
and :user:`András Simon <simonandras>`.

Miscellaneous
.............

- |Efficiency| Low-level routines for reductions on pairwise distances
for dense float64 datasets have been refactored. The following functions
and estimators now benefit from improved performances, in particular on
multi-cores machines:
- :func:`sklearn.metrics.pairwise_distances_argmin`
- :func:`sklearn.metrics.pairwise_distances_argmin_min`
- :class:`sklearn.cluster.AffinityPropagation`
- :class:`sklearn.cluster.Birch`
- :class:`sklearn.cluster.DBSCAN`
- :class:`sklearn.cluster.MeanShift`
- :class:`sklearn.cluster.OPTICS`
- :class:`sklearn.cluster.SpectralClustering`
- :func:`sklearn.feature_selection.mutual_info_regression`
- :class:`sklearn.neighbors.KNeighborsClassifier`
- :class:`sklearn.neighbors.KNeighborsRegressor`
- :class:`sklearn.neighbors.LocalOutlierFactor`
- :class:`sklearn.neighbors.NearestNeighbors`
- :class:`sklearn.manifold.Isomap`
- :class:`sklearn.manifold.LocallyLinearEmbedding`
- :class:`sklearn.manifold.TSNE`
- :func:`sklearn.manifold.trustworthiness`
- :class:`sklearn.semi_supervised.LabelPropagation`
- :class:`sklearn.semi_supervised.LabelSpreading`

For instance :class:`sklearn.neighbors.NearestNeighbors.kneighbors`
can be up to 20× faster than in the previous versions'.

:pr:`20254` by :user:`Julien Jerphanion <jjerphan>`.


Code and Documentation Contributors
-----------------------------------

Expand Down
23 changes: 22 additions & 1 deletion sklearn/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
"working_memory": int(os.environ.get("SKLEARN_WORKING_MEMORY", 1024)),
"print_changed_only": True,
"display": "text",
"pairwise_dist_chunk_size": int(
os.environ.get("SKLEARN_PAIRWISE_DIST_CHUNK_SIZE", 256)
),
}
_threadlocal = threading.local()

Expand Down Expand Up @@ -40,7 +43,11 @@ def get_config():


def set_config(
assume_finite=None, working_memory=None, print_changed_only=None, display=None
assume_finite=None,
working_memory=None,
print_changed_only=None,
display=None,
pairwise_dist_chunk_size=None,
):
"""Set global scikit-learn configuration

Expand Down Expand Up @@ -80,6 +87,12 @@ def set_config(

.. versionadded:: 0.23

pairwise_dist_chunk_size : int, default=None
The number of vectors per chunk for PairwiseDistancesReduction.
Default is 256 (optimal for most of modern laptops' caches and architectures).

.. versionadded:: 1.1

See Also
--------
config_context : Context manager for global scikit-learn configuration.
Expand All @@ -95,6 +108,8 @@ def set_config(
local_config["print_changed_only"] = print_changed_only
if display is not None:
local_config["display"] = display
if pairwise_dist_chunk_size is not None:
local_config["pairwise_dist_chunk_size"] = pairwise_dist_chunk_size


@contextmanager
Expand Down Expand Up @@ -132,6 +147,12 @@ def config_context(**new_config):

.. versionadded:: 0.23

pairwise_dist_chunk_size : int, default=None
The number of vectors per chunk for PairwiseDistancesReduction.
Default is 256 (optimal for most of modern laptops' caches and architectures).

.. versionadded:: 1.1

Notes
-----
All settings, not just those presently modified, will be returned to
Expand Down
4 changes: 3 additions & 1 deletion sklearn/cluster/_affinity_propagation.py
Original file line number Diff line number Diff line change
Expand Up @@ -523,7 +523,9 @@ def predict(self, X):

if self.cluster_centers_.shape[0] > 0:
with config_context(assume_finite=True):
return pairwise_distances_argmin(X, self.cluster_centers_)
return pairwise_distances_argmin(
X, self.cluster_centers_, metric="fast_euclidean"
)
else:
warnings.warn(
"This model does not have any cluster centers "
Expand Down
11 changes: 8 additions & 3 deletions sklearn/cluster/_birch.py
Original file line number Diff line number Diff line change
Expand Up @@ -676,11 +676,15 @@ def predict(self, X):
"""
check_is_fitted(self)
X = self._validate_data(X, accept_sparse="csr", reset=False)
kwargs = {"Y_norm_squared": self._subcluster_norms}

fast_euclidean_kwargs = {"Y_norm_squared": self._subcluster_norms}

with config_context(assume_finite=True):
argmin = pairwise_distances_argmin(
X, self.subcluster_centers_, metric_kwargs=kwargs
X,
self.subcluster_centers_,
metric="fast_euclidean",
metric_kwargs=fast_euclidean_kwargs,
)
return self.subcluster_labels_[argmin]

Expand Down Expand Up @@ -726,7 +730,8 @@ def _global_clustering(self, X=None):
"n_clusters should be an instance of ClusterMixin or an int"
)

# To use in predict to avoid recalculation.
# We compute subcluster norms once here, so that we won't need to compute it
# again at each call of `Birch.predict`.
self._subcluster_norms = row_norms(self.subcluster_centers_, squared=True)

if clusterer is None or not_enough_centroids:
Expand Down
4 changes: 3 additions & 1 deletion sklearn/cluster/_mean_shift.py
Original file line number Diff line number Diff line change
Expand Up @@ -512,4 +512,6 @@ def predict(self, X):
check_is_fitted(self)
X = self._validate_data(X, reset=False)
with config_context(assume_finite=True):
return pairwise_distances_argmin(X, self.cluster_centers_)
return pairwise_distances_argmin(
X, self.cluster_centers_, metric="fast_euclidean"
)
37 changes: 37 additions & 0 deletions sklearn/metrics/_dist_metrics.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,22 @@ cdef class DistanceMetric:
cdef DTYPE_t rdist(self, const DTYPE_t* x1, const DTYPE_t* x2,
ITYPE_t size) nogil except -1

cdef DTYPE_t csr_dist(
self,
const DTYPE_t[:] x1_data,
const ITYPE_t[:] x1_indices,
const DTYPE_t[:] x2_data,
const ITYPE_t[:] x2_indices,
) nogil except -1

cdef DTYPE_t csr_rdist(
self,
const DTYPE_t[:] x1_data,
const ITYPE_t[:] x1_indices,
const DTYPE_t[:] x2_data,
const ITYPE_t[:] x2_indices,
) nogil except -1

cdef int pdist(self, const DTYPE_t[:, ::1] X, DTYPE_t[:, ::1] D) except -1

cdef int cdist(self, const DTYPE_t[:, ::1] X, const DTYPE_t[:, ::1] Y,
Expand All @@ -69,3 +85,24 @@ cdef class DistanceMetric:
cdef DTYPE_t _rdist_to_dist(self, DTYPE_t rdist) nogil except -1

cdef DTYPE_t _dist_to_rdist(self, DTYPE_t dist) nogil except -1


######################################################################
# DatasetsPair base class
cdef class DatasetsPair:
cdef DistanceMetric distance_metric

cdef ITYPE_t n_samples_X(self) nogil

cdef ITYPE_t n_samples_Y(self) nogil

cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil

cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil


cdef class DenseDenseDatasetsPair(DatasetsPair):
cdef:
const DTYPE_t[:, ::1] X
const DTYPE_t[:, ::1] Y
ITYPE_t d
Loading