-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA Introduce PairwiseDistances
#23958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Introduce PairwiseDistances
#23958
Conversation
ಠ_ರೃ
Failures on this PR might are fixed by #23990. |
Switch condition for safety
Comes with minor adaptations
This: - decreases the number of features by an order to magnetude because in the case of float32, the vectors gets entirely copied for the upcast to float64. This might use too much memory and crash the program - this now accepts the previously xfail parametrisation case by setting on absolute error (seen we are comparing small values)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback below:
f"usable for this case (EuclideanPairwiseDistances{{name_suffix}}) and will be ignored.", | ||
UserWarning, | ||
stacklevel=3, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather avoid introducing a new warning that we want to actually get rid off, so better do this as part of the current PR.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Using scikit-learn/sklearn/metrics/_dist_metrics.pyx.tp Lines 1110 to 1120 in e41753e
I am not entirely but looking at cycles spent on the assembly code line, we can see that loop unrolling + SIMD Percent│
│ Disassembly of section .text:
│
│ 00000000000104b0 <__pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist>:
│ __pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist():
0.30 │ test %rcx,%rcx
0.04 │ ↓ jle 40
0.00 │ movq __pyx_k_C+0x6b4b,%xmm2
0.01 │ xor %eax,%eax
│ pxor %xmm1,%xmm1
0.00 │ nop
0.00 │18: movsd (%rsi,%rax,8),%xmm0
0.28 │ subsd (%rdx,%rax,8),%xmm0
0.25 │ add $0x1,%rax
0.04 │ andpd %xmm2,%xmm0
0.18 │ addsd %xmm0,%xmm1
+98.76 │ cmp %rax,%rcx
0.10 │ ↑ jne 18
0.00 │ movapd %xmm1,%xmm0
0.04 │ ← retq
│ nop
│40: pxor %xmm1,%xmm1
│ movapd %xmm1,%xmm0
│ ← retq It's insightful to see how it's done, and to assess that 98.76% the time here is spent checking if |
As discussed in real life, it might be interesting to see of the chunking is not detrimental for this operation: indeed chunking will cause non-contiguous writing of the distance values in the distance matrix output array. It might be worthwhile to conduct dedicated benchmarks to use a chunk size of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just propose to add some references to #24745 in TODO comments.
My first intuition is that simply setting I think What do you think? |
I am removing it from the 1.2 milestone as it needs more thought and thread-offs' assessments. |
Turning this into a draft for two reasons; to me:
|
Closing since this has been superseded by #25561. |
Reference Issues/PRs
Relates to #22587.
What does this implement/fix? Explain your changes.
This adds a new back-end to
pairwise_distances
computations usingPairwiseDistances
without any reduction.Any other comments?
TODO: