ENH Add the fused CSR dense case for Euclidean Specializations #25044

jjerphan · 2022-11-25T16:04:18Z

Reference Issues/PRs

Towards #22587.
Follow up of #24556.

What does this implement/fix? Explain your changes.

The CSR-dense and the dense-CSR cases were chosen not to be supported for the Euclidean specialization of all PairwiseDistancesReductions (for more context see #23585 (comment)).

This PR implements SparseDenseMiddleTermComputer allows computing the middle term of the distance matrix decomposition for the Euclidean specializations, covering those two missing cases.

Hence this completes all the combinations for the Euclidean specialisations (:tada:).

Any other comments?

Different designs have been explored in other Pull Requests to factor some logic altogether or rethink DatasetsPairs w.r.t. MiddleTermComputer for the Euclidean specialisations.

In overall, this PR seems to have the best tradeoff regarding performance and duplication of code.

Benchmarks

This makes using PairwiseDistancesReductions on the CSR-dense and the dense-CSR for euclidean competitive w.r.t to the previous implementation relying on joblib.

One can get up to ×2 on a laptop.

Details

In [1]: from sklearn.neighbors import NearestNeighbors
   ...: from sklearn.datasets import make_classification
   ...: from scipy.sparse import csr_matrix
   ...: 
   ...: import sklearn

In [2]: X_train, _ = make_classification(n_samples=100_000, n_features=100)
   ...: X_test, _ = make_classification(n_samples=256, n_features=100)
   ...: 
   ...: X_test = csr_matrix(X_test)

In [3]: nn = NearestNeighbors().fit(X_train)

In [4]: %%timeit
   ...: nn.kneighbors(X_test)
   ...: 
   ...: 
899 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %%timeit
   ...: with sklearn.config_context(enable_cython_pairwise_dist=False):
   ...:     nn.kneighbors(X_test)
   ...: 
2.24 s ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: X_train, _ = make_classification(n_samples=256, n_features=100)
   ...: X_test, _ = make_classification(n_samples=100_000, n_features=100)
   ...: 
   ...: X_test = csr_matrix(X_test)

In [7]: nn = NearestNeighbors().fit(X_train)

In [8]: %%timeit
   ...: nn.kneighbors(X_test)
   ...: 
   ...: 
760 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %%timeit
   ...: with sklearn.config_context(enable_cython_pairwise_dist=False):
   ...:     nn.kneighbors(X_test)
   ...: 
1.22 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

TODO:

Refactor the logic to:
- Remove the cast on the entire dense array and data array and only upcast chunks using dedicated buffers (as done for DenseDenseMiddleTermComputer64
- Redesign DatasetsPair w.r.t the new MiddleTermComputer to remove the duplicated logic and to clarify responsabilities (esp. squared euclidean norm computations)
  - Comment: the redesign was tried in ENH Euclidean specialization of DatasetsPair instead of ArgKmin and RadiusNeighbors #25170, works, but suffers from a lot of method dispatches overhead
Add more tests?

This was already studied in: scikit-learn#25449 Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>

jjerphan · 2023-02-27T12:53:12Z

Hi @OmarManzoor, @Vincent-Maladiere, @Micky774 and @adam2392: might some of you might be interested in reviewing this PR? 🙂

OmarManzoor · 2023-02-27T12:57:26Z

Hi @OmarManzoor, @Vincent-Maladiere and @adam2392: might some of you might be interested in reviewing this PR? 🙂

Sure! I might need some time to understand the context of the PR though.

jjerphan · 2023-02-27T13:02:11Z

I am wondering about the kind of tests we could add. Should we add tests on combinations of sparse and dense datasets for all the interfaces? Should we add more of them?

jjerphan · 2023-02-27T13:03:27Z

I might need some time to understand the context of the PR though.

Yes, feel free to take your time and do not feel pressured (the class hierarchy is rather dense and its implementations Tempita-heavy).

This is required with Cython>=3.0.

jjerphan · 2023-02-28T08:25:11Z

Should I write more about the design?

OmarManzoor

LGTM. Maybe we can enhance the tests by including the combinations within test_sqeuclidean_row_norms and test_pairwise_distances_argkmin.

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

doc/whats_new/v1.3.rst

ogrisel

Great work!

and an extra +1 for the opportunity to remove the lines with the XOR operator which were not easy to reason about :)

doc/whats_new/v1.3.rst

ogrisel · 2023-03-01T15:41:38Z

Add more tests?

I think there are enough tests in this PR and the already existing tests in main.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>

lorentzenchr

LGTM
PS: I had to find something:smirk:

doc/whats_new/v1.3.rst

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

lorentzenchr · 2023-03-10T10:50:52Z

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

-        # uses the Squared Euclidean matrix decomposition, i.e.:
-        #
-        #       ||X_c_i - Y_c_j||² = ||X_c_i||² - 2 X_c_i.Y_c_j^T + ||Y_c_j||²


Is this "trick" still documented somewhere in the pairwise distances codebase?

Yes, it is given here:

scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

Lines 77 to 93 in dfe9e2e

cdef class MiddleTermComputer{{name_suffix}}:

"""Helper class to compute a Euclidean distance matrix in chunks.

This is an abstract base class that is further specialized depending

on the type of data (dense or sparse).

`EuclideanDistance` subclasses relies on the squared Euclidean

distances between chunks of vectors X_c and Y_c using the

following decomposition for the (i,j) pair :

||X_c_i - Y_c_j||² = ||X_c_i||² - 2 X_c_i.Y_c_j^T + ||Y_c_j||²

This helper class is in charge of wrapping the common logic to compute

the middle term, i.e. `- 2 X_c_i.Y_c_j^T`.

"""

I am thinking of writing a couple of notes for the design of sklearn.metrics._pairwise_distances_reduction. What do you think of it?

That would be great. Not too many details, but giving a good overview.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lorentzenchr · 2023-03-10T11:19:45Z

Which co-authors should I keep when merging?

jjerphan · 2023-03-10T11:49:31Z

Which co-authors should I keep when merging?

All, I guess? I let this choice to the maintainer merging.

jjerphan · 2023-03-10T12:34:03Z

Thank you for your reviews, @OmarManzoor, @ogrisel, and @lorentzenchr.

github-actions bot added module:metrics cython labels Nov 25, 2022

jjerphan added No Changelog Needed and removed module:metrics cython labels Nov 25, 2022

jjerphan marked this pull request as draft November 25, 2022 16:13

jjerphan mentioned this pull request Nov 25, 2022

PERF PairwiseDistancesReductions initial work #22587

Closed

Vincent-Maladiere mentioned this pull request Dec 12, 2022

ENH Euclidean specialization of DatasetsPair instead of ArgKmin and RadiusNeighbors #25170

Closed

ENH Add the fused CSR dense case for Euclidean Specializations

0243c1d

jjerphan force-pushed the enh/pdr-fused-sparse-dense-euclidean branch from a943f79 to 0243c1d Compare February 1, 2023 10:26

jjerphan added 2 commits February 27, 2023 11:39

Merge branch 'main' into enh/pdr-fused-sparse-dense-euclidean

3344567

Remove the upcast from float32 to float64

303bc90

jjerphan marked this pull request as ready for review February 27, 2023 10:55

jjerphan and others added 2 commits February 27, 2023 12:09

DOC Add a changelog entry for 1.3

8576390

Remove outdated TODO comment

a874def

This was already studied in: scikit-learn#25449 Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>

jjerphan added Waiting for Reviewer cython labels Feb 27, 2023

Add noexcept qualification

12700d6

This is required with Cython>=3.0.

OmarManzoor approved these changes Feb 28, 2023

View reviewed changes

jjerphan and others added 2 commits February 28, 2023 14:26

TST Completes tests

6563505

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

DOC Reword whats_new entry

ff6d47f

jjerphan commented Mar 1, 2023

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.3.rst Show resolved Hide resolved

DOC Reword whats_new entry

17b6845

ogrisel approved these changes Mar 1, 2023

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

DOC Better word the whats_new entry

22d2cd2

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan added Waiting for Second Reviewer First reviewer is done, need a second one! and removed Waiting for Reviewer labels Mar 1, 2023

DOC Better document implementation

45d425a

Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan force-pushed the enh/pdr-fused-sparse-dense-euclidean branch from 6a3f58c to 45d425a Compare March 9, 2023 09:38

lorentzenchr approved these changes Mar 10, 2023

View reviewed changes

DOC Better word comment and changelog entry

b95ca39

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

jjerphan added Performance and removed Waiting for Second Reviewer First reviewer is done, need a second one! No Changelog Needed labels Mar 10, 2023

lorentzenchr enabled auto-merge (squash) March 10, 2023 12:24

lorentzenchr approved these changes Mar 10, 2023

View reviewed changes

lorentzenchr merged commit 67ea720 into scikit-learn:main Mar 10, 2023

jjerphan deleted the enh/pdr-fused-sparse-dense-euclidean branch March 10, 2023 12:32

jjerphan mentioned this pull request Mar 29, 2023

Introduce SIMD intrinsics for _dist_metrics.pyx #26010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add the fused CSR dense case for Euclidean Specializations #25044

ENH Add the fused CSR dense case for Euclidean Specializations #25044

jjerphan commented Nov 25, 2022 •

edited

Loading

jjerphan commented Feb 27, 2023 •

edited

Loading

OmarManzoor commented Feb 27, 2023 •

edited

Loading

jjerphan commented Feb 27, 2023

jjerphan commented Feb 27, 2023 •

edited

Loading

jjerphan commented Feb 28, 2023

OmarManzoor left a comment

ogrisel left a comment •

edited

Loading

ogrisel commented Mar 1, 2023

lorentzenchr left a comment

lorentzenchr Mar 10, 2023

jjerphan Mar 10, 2023

lorentzenchr Mar 10, 2023

lorentzenchr commented Mar 10, 2023

jjerphan commented Mar 10, 2023 •

edited

Loading

jjerphan commented Mar 10, 2023

	cdef class MiddleTermComputer{{name_suffix}}:
	"""Helper class to compute a Euclidean distance matrix in chunks.

	This is an abstract base class that is further specialized depending
	on the type of data (dense or sparse).

	`EuclideanDistance` subclasses relies on the squared Euclidean
	distances between chunks of vectors X_c and Y_c using the
	following decomposition for the (i,j) pair :


	\|\|X_c_i - Y_c_j\|\|² = \|\|X_c_i\|\|² - 2 X_c_i.Y_c_j^T + \|\|Y_c_j\|\|²


	This helper class is in charge of wrapping the common logic to compute
	the middle term, i.e. `- 2 X_c_i.Y_c_j^T`.
	"""

ENH Add the fused CSR dense case for Euclidean Specializations #25044

ENH Add the fused CSR dense case for Euclidean Specializations #25044

Conversation

jjerphan commented Nov 25, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Benchmarks

jjerphan commented Feb 27, 2023 • edited Loading

OmarManzoor commented Feb 27, 2023 • edited Loading

jjerphan commented Feb 27, 2023

jjerphan commented Feb 27, 2023 • edited Loading

jjerphan commented Feb 28, 2023

OmarManzoor left a comment

Choose a reason for hiding this comment

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel commented Mar 1, 2023

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr Mar 10, 2023

Choose a reason for hiding this comment

jjerphan Mar 10, 2023

Choose a reason for hiding this comment

lorentzenchr Mar 10, 2023

Choose a reason for hiding this comment

lorentzenchr commented Mar 10, 2023

jjerphan commented Mar 10, 2023 • edited Loading

jjerphan commented Mar 10, 2023

jjerphan commented Nov 25, 2022 •

edited

Loading

jjerphan commented Feb 27, 2023 •

edited

Loading

OmarManzoor commented Feb 27, 2023 •

edited

Loading

jjerphan commented Feb 27, 2023 •

edited

Loading

ogrisel left a comment •

edited

Loading

jjerphan commented Mar 10, 2023 •

edited

Loading