-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Adapt PairwiseDistancesReduction
heuristic for strategy="auto"
#24043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Adapt PairwiseDistancesReduction
heuristic for strategy="auto"
#24043
Conversation
The radius definition is from Olivier: scikit-learn#22320 (review) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
e0aad59
to
fec7843
Compare
With the current heuristic, we get using the
Full asv trace
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Do you confirm that this fixes the scalability degradation observed in the plot of the description of #23865 when the number of CPU threads is large compared to the number of samples?
I am +1 for merging as is as this is already a net performance improvement but I as suspect that the variant I discuss below might even be a bit better in some cases. Can be done in a follow-up PR.
|
||
param_names = ["n_train", "n_test", "n_features", "metric", "strategy"] | ||
params = [ | ||
[1000, 10_000, int(1e7)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated in #24120 (review) and #24120 (comment), I do not think we can include this under asv_benchmarks
.
I'm okay with only updating the heuristic in this PR and merge the benchmark separately.
I can't confirm because I have not explored it as I thought it to be orthogonal to fixing the heuristic. Still, it has to be done and I guess a subsequent dedicated PR is appropriate. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I've accepted @ogrisel's suggestion which code-wise revert this branch to fec7843c, the revision for which benchmark results' (see #24043 (comment)) were reported. I think this can be further finer-grained refined (e.g. tuning constants), but at least this fixes behaviour observed in #24076 so that @Micky774 can perform benchmark. |
@thomasjpfan ok for merge with this updated version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…o"` (scikit-learn#24043) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Part of #22587.
What does this implement/fix? Explain your changes.
Currently, the heuristic to choose between
parallel_on_X
orparallel_on_Y
is sub-optimal and does not take the number of samples ofY
into account.This changes the heuristic to add a comparison between the number of samples of the two datasets.