-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Euclidean specialization of DatasetsPair
instead of ArgKmin
and RadiusNeighbors
#25170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Euclidean specialization of DatasetsPair
instead of ArgKmin
and RadiusNeighbors
#25170
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Vincent-Maladiere for this significant rework! 💯
The design looks good and implementable at first sight.
I think we better check for any performance regression now before proceeding. In this regards, I think scikit-learn/pairwise-distances-reductions-asv-suite
can help you.
I recommend reducing the parametrisation of the benchmark to cover:
metric="euclidean"
strategy="auto"
dtype in [float32, float64]
(X_train, X_test) in (("dense", "dense"), ("csr", "csr"))
return_distance=True
and running them only for time_ArgKmin
, which can done using:
asv continuous --verbose --show-stderr --split \
--bench "PairwiseDistancesReductionsBenchmark.time_ArgKmin" \
main refs/pull/25170/head
On a machine using 128 physical cores, this might takes easily 10 hours to run (because the timeout default values has been changed from 500 to 1000).
You might want to start with quick runs to check for any problems or for a rough assessment of performance assessement. For this, I recommend:
- running the benchmark on pairs of datasets which small up to medium number of samples (i.e.
1000
up to100_000
) - adding the
--quick
option for theasv continuous
command above to only run each combinations only once. - to decrease the timeout down to a few hundreds seconds
In the meantime, here are a few comment on this first iteration.
@@ -417,8 +350,12 @@ cdef class SparseSparseMiddleTermComputer{{name_suffix}}(MiddleTermComputer{{nam | |||
|
|||
def __init__( | |||
self, | |||
X, | |||
Y, | |||
const DTYPE_t[:] X_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const DTYPE_t[:] X_data, | |
const {{INPUT_DTYPE_t}}[:] X_data, |
const DTYPE_t[:] X_data, | ||
const SPARSE_INDEX_TYPE_t[:] X_indices, | ||
const SPARSE_INDEX_TYPE_t[:] X_indptr, | ||
const DTYPE_t[:] Y_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const DTYPE_t[:] Y_data, | |
const {{INPUT_DTYPE_t}}[:] Y_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is now fixed.
This one is interesting, though, because in the main branch we currently cast X_data
and Y_data
to float64, and we call the routine sparse_sparse_middle_term_computation_64
for both SparseMiddleTermComputer{32, 64}
.
I saw that you changed this behavior in your work on SparseDenseDatasetsPair
to avoid casting. Could you provide more details about this decision?
self.middle_term_computer._compute_dist_middle_terms( | ||
X_start, | ||
X_end, | ||
Y_start, | ||
Y_end, | ||
thread_num, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should rather be moved into MiddleTermComputer._parallel_on_Y_pre_compute_and_reduce_distances_on_chunks
.
self.middle_term_computer._compute_dist_middle_terms( | ||
X_start, | ||
X_end, | ||
Y_start, | ||
Y_end, | ||
thread_num, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should rather be moved into MiddleTermComputer._parallel_on_X_pre_compute_and_reduce_distances_on_chunks
.
): | ||
super().__init__(X, Y, distance_metric) | ||
|
||
# Used to compute the surrogate distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be read as:
It used to compute the surrogate distance.
rather than (what I think you meant):
It is used to compute the surrogate distance.
Moreover, can you explain how this is being used?
# several orders of magnitude compared to the generic {ArgKmin, RadiusNeighbors} | ||
# implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# several orders of magnitude compared to the generic {ArgKmin, RadiusNeighbors} | |
# implementation. | |
# several orders of magnitude compared to the generic | |
# {ArgKmin, RadiusNeighbors} implementations. |
if use_euclidean_specialization: | ||
use_squared_distances = metric == "sqeuclidean" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right hand side expression can be made inline for the two calls with the use_squared_distances
hereinafter and those two lines be removed.
dict metric_kwargs=None, | ||
dict euclidean_kwargs=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would combine both extra kwargs
dictionary here in a single metric_kwargs
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would make sense and also would be better than having two kwargs.
We need to extract variables related to MiddleTermComputer
from metric_kwargs
before passing metric_kwargs
to DistanceMetric
during DatasetsPair.get_for()
.
) | ||
|
||
|
||
cdef class EuclideanRadiusNeighbors{{name_suffix}}(RadiusNeighbors{{name_suffix}}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again: So noice. 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few other comments on top of the older ones.
X_start, | ||
Y_start, | ||
X_start + i, | ||
Y_start + j, | ||
i, | ||
j, | ||
n_samples_Y, | ||
thread_num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use keyword arguments, here?
This comment also applies to other places calling DatasetsPair.surrogate_dist
.
@@ -41,6 +41,7 @@ cdef class DatasetsPair{{name_suffix}}: | |||
ITYPE_t Y_start, | |||
ITYPE_t i, | |||
ITYPE_t j, | |||
ITYPE_t n_Y, | |||
ITYPE_t thread_num=* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to use *
as a default, here? What is the semantic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added *
so that surrogate_dist
could take 0
as the default for thread_num
. Since thread_num
is a new argument of surrogate_dist
, I've set a default parameter to avoid impacting other calls. WDYT?
@Vincent-Maladiere: @Micky774 has been working on As discussed with @Micky774 on a call, both of you might be interested in syncing up since this PR and the first linked above since they are highly similar in scope. |
I ran ASV benchmark between I will run profiling via py-spy / speedscope on this branch to compare it with
Full ASV results
|
@Vincent-Maladiere: thanks for reporting those results! Like you have proposed, I would profile the first reported case on the two commits (i.e.
Native code can be profiled and results exported for SpeedScope using:
Profiles' files can be uploaded here so that they be inspected by readers as well. |
For future readers, the py-spy I ran the following benchmark on a single thread to ease the understanding with py-spy:
import numpy as np
import threadpoolctl
from sklearn.metrics._pairwise_distances_reduction import ArgKmin
rng = np.random.RandomState(42)
dtype = np.float64
n_train, n_test, n_features = 10_000, 100_000, 100
X_train = rng.randn(n_train, n_features).astype(dtype)
X_test = rng.randn(n_test, n_features).astype(dtype)
controler = threadpoolctl.ThreadpoolController()
with controler.limit(limits=1, user_api=None):
ArgKmin.compute(
X=X_test,
Y=X_train,
k=10,
metric="euclidean",
return_distance=True,
strategy="auto",
) GitHub can't handle py-spy SVG profiles, so I placed them on Dropbox. You can download them at: and upload each profile on speedscope.app. Using the left-heavy view of speedscope, one can see that So, in an attempt to optimize However, after a new ASV benchmark on 64 cores, we still have a similar performance decrease. Another possible explanation for this decrease is that we make a lot of function calls to WDYT @jjerphan? |
Thanks for reporting this, @Vincent-Maladiere.
I think this might be a reason for the performance regression, still when looking at the two I think we need to spend some time inspecting it with other tools like |
Thank you for exploring this, @Vincent-Maladiere. This allowed us to understand the tradeoff of flexible design vs performance impact due to methods' dispatch, making us converge on #25044. I am closing this PR for now, probably we could reopen it if dispatching methods becomes less costly in the future. |
Reference Issues/PRs
Follow up of #25044
What does this implement/fix? Explain your changes.
This PR removes the euclidean specialization logic from
EuclideanArgKmin
andEuclideanRadiusNeighbors
toEuclidean{DenseDense, SparseSparse}DatasetsPair
.This is how
EuclideanArgKmin
andEuclideanRadiusNeighbors
are currently tied toMiddleTermComputer
:This is how this PR suggests removing
EuclideanArgKmin
andEuclideanRadiusNeighbors
and introducingEuclidean{DenseDense, SparseSparse}DatasetsPair
instead:Any other comments?
Done
DatasetPairs
is instantiated during the__init__
ofBaseDistancesReduction
because some parameters computed during the latter are needed in the former.{DenseDense, SparseSparse}MiddleTermComputer
are instantiated directly during the__init__
ofEuclidean{DenseDense, SparseSparse}DatasetsPair
, removing the need for aget_for
classmethod to dispatch cases inMiddleTermComputer
parallel_on_{X, Y}_pre_compute_and_reduce_distances_on_chunks()
in{ArgKmin, RadiusNeighbors}
computes and storesdist_middle_term
withinMiddleTermComputer
.ArgKmin.surrogate_dist()
orRadiusNeighbors.surrogate_dist()
performs a call toDatasetsPair
and then toMiddleTermComputer
to get thedist_middle_term
quantity.TODO
cc @jjerphan (and @Arnaud15 who is interested in this PR).