-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Add the fused CSR dense case for Euclidean Specializations #25044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add the fused CSR dense case for Euclidean Specializations #25044
Conversation
a943f79
to
0243c1d
Compare
This was already studied in: scikit-learn#25449 Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
Hi @OmarManzoor, @Vincent-Maladiere, @Micky774 and @adam2392: might some of you might be interested in reviewing this PR? 🙂 |
Sure! I might need some time to understand the context of the PR though. |
I am wondering about the kind of tests we could add. Should we add tests on combinations of sparse and dense datasets for all the interfaces? Should we add more of them? |
Yes, feel free to take your time and do not feel pressured (the class hierarchy is rather dense and its implementations Tempita-heavy). |
This is required with Cython>=3.0.
Should I write more about the design? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Maybe we can enhance the tests by including the combinations within test_sqeuclidean_row_norms and test_pairwise_distances_argkmin.
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
and an extra +1 for the opportunity to remove the lines with the XOR operator which were not easy to reason about :)
I think there are enough tests in this PR and the already existing tests in |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Signed-off-by: Julien Jerphanion <git@jjerphan.xyz>
6a3f58c
to
45d425a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PS: I had to find something:smirk:
sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp
Outdated
Show resolved
Hide resolved
# uses the Squared Euclidean matrix decomposition, i.e.: | ||
# | ||
# ||X_c_i - Y_c_j||² = ||X_c_i||² - 2 X_c_i.Y_c_j^T + ||Y_c_j||² |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this "trick" still documented somewhere in the pairwise distances codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is given here:
scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp
Lines 77 to 93 in dfe9e2e
cdef class MiddleTermComputer{{name_suffix}}: | |
"""Helper class to compute a Euclidean distance matrix in chunks. | |
This is an abstract base class that is further specialized depending | |
on the type of data (dense or sparse). | |
`EuclideanDistance` subclasses relies on the squared Euclidean | |
distances between chunks of vectors X_c and Y_c using the | |
following decomposition for the (i,j) pair : | |
||X_c_i - Y_c_j||² = ||X_c_i||² - 2 X_c_i.Y_c_j^T + ||Y_c_j||² | |
This helper class is in charge of wrapping the common logic to compute | |
the middle term, i.e. `- 2 X_c_i.Y_c_j^T`. | |
""" |
I am thinking of writing a couple of notes for the design of sklearn.metrics._pairwise_distances_reduction
. What do you think of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great. Not too many details, but giving a good overview.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Which co-authors should I keep when merging? |
All, I guess? I let this choice to the maintainer merging. |
Thank you for your reviews, @OmarManzoor, @ogrisel, and @lorentzenchr. |
Reference Issues/PRs
Towards #22587.
Follow up of #24556.
What does this implement/fix? Explain your changes.
The CSR-dense and the dense-CSR cases were chosen not to be supported for the Euclidean specialization of all
PairwiseDistancesReductions
(for more context see #23585 (comment)).This PR implements
SparseDenseMiddleTermComputer
allows computing the middle term of the distance matrix decomposition for the Euclidean specializations, covering those two missing cases.Hence this completes all the combinations for the Euclidean specialisations (:tada:).
Any other comments?
Different designs have been explored in other Pull Requests to factor some logic altogether or rethink
DatasetsPairs
w.r.t.MiddleTermComputer
for the Euclidean specialisations.In overall, this PR seems to have the best tradeoff regarding performance and duplication of code.
Benchmarks
This makes using
PairwiseDistancesReductions
on the CSR-dense and the dense-CSR for euclidean competitive w.r.t to the previous implementation relying on joblib.One can get up to ×2 on a laptop.
Details
TODO:
DenseDenseMiddleTermComputer64
DatasetsPair
w.r.t the newMiddleTermComputer
to remove the duplicated logic and to clarify responsabilities (esp. squared euclidean norm computations)DatasetsPair
instead ofArgKmin
andRadiusNeighbors
#25170, works, but suffers from a lot of method dispatches overhead