Skip to content

FEA Introduce PairwiseDistances #23958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jul 19, 2022

Reference Issues/PRs

Relates to #22587.

What does this implement/fix? Explain your changes.

This adds a new back-end to pairwise_distances computations using PairwiseDistances without any reduction.

Any other comments?

TODO:

@jjerphan
Copy link
Member Author

Failures on this PR might are fixed by #23990.

jjerphan and others added 6 commits September 28, 2022 17:58
This:
 - decreases the number of features by an order to magnetude
   because in the case of float32, the vectors gets entirely
   copied for the upcast to float64. This might use too much
   memory and crash the program
 - this now accepts the previously xfail parametrisation case
   by setting on absolute error (seen we are comparing small
   values)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more feedback below:

f"usable for this case (EuclideanPairwiseDistances{{name_suffix}}) and will be ignored.",
UserWarning,
stacklevel=3,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather avoid introducing a new warning that we want to actually get rid off, so better do this as part of the current PR.

@jjerphan
Copy link
Member Author

jjerphan commented Oct 12, 2022

Using perf(1), we can see that when running pairwise_distances(X, Y, dist="manhattan"), 98% of the time is spent computing distances, i.e. in ManhattanDistance{32,64}.dist:

cdef inline DTYPE_t dist(
self,
const {{INPUT_DTYPE_t}}* x1,
const {{INPUT_DTYPE_t}}* x2,
ITYPE_t size,
) nogil except -1:
cdef DTYPE_t d = 0
cdef cnp.intp_t j
for j in range(size):
d += fabs(x1[j] - x2[j])
return d

I am not entirely but looking at cycles spent on the assembly code line, we can see that loop unrolling + SIMD is could used (here, SSE registers are used on scalars):

Percent│
       │    Disassembly of section .text:
       │
       │    00000000000104b0 <__pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist>:
       │    __pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist():
  0.30 │      test   %rcx,%rcx
  0.04 │    ↓ jle    40
  0.00 │      movq   __pyx_k_C+0x6b4b,%xmm2
  0.01 │      xor    %eax,%eax
       │      pxor   %xmm1,%xmm1
  0.00 │      nop
  0.00 │18:   movsd  (%rsi,%rax,8),%xmm0
  0.28 │      subsd  (%rdx,%rax,8),%xmm0
  0.25 │      add    $0x1,%rax
  0.04 │      andpd  %xmm2,%xmm0
  0.18 │      addsd  %xmm0,%xmm1
+98.76 │      cmp    %rax,%rcx
  0.10 │    ↑ jne    18
  0.00 │      movapd %xmm1,%xmm0
  0.04 │    ← retq
       │      nop
       │40:   pxor   %xmm1,%xmm1
       │      movapd %xmm1,%xmm0
       │    ← retq   

It's insightful to see how it's done, and to assess that 98.76% the time here is spent checking if j has reached the range bound, size. I do not know what causes the execution to be spent mostly on this comparison.

@jeremiedbb jeremiedbb added this to the 1.2 milestone Oct 13, 2022
@ogrisel
Copy link
Member

ogrisel commented Oct 17, 2022

As discussed in real life, it might be interesting to see of the chunking is not detrimental for this operation: indeed chunking will cause non-contiguous writing of the distance values in the distance matrix output array.

It might be worthwhile to conduct dedicated benchmarks to use a chunk size of 1 to see if contiguous writing is beneficial or not.

Copy link
Member Author

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just propose to add some references to #24745 in TODO comments.

@jjerphan
Copy link
Member Author

It might be worthwhile to conduct dedicated benchmarks to use a chunk size of 1 to see if contiguous writing is beneficial or not.

My first intuition is that simply setting chunk_size=1 might help but we will still relies on a rather to complicated scheduling for this case: some details like the heuristic will be adapted, we might spend a lot of time jumping in many places due to singletonic chunks, etc..

I think _parallel_on_{X,Y} in this case can be made relatively simpler (namely it would just generalised the current _sparse_manhattan logic).

What do you think?

@jjerphan
Copy link
Member Author

I am removing it from the 1.2 milestone as it needs more thought and thread-offs' assessments.

@jjerphan jjerphan removed this from the 1.2 milestone Nov 22, 2022
@jjerphan
Copy link
Member Author

jjerphan commented Jan 5, 2023

Turning this into a draft for two reasons; to me:

@jjerphan
Copy link
Member Author

Closing since this has been superseded by #25561.

@jjerphan jjerphan closed this Feb 28, 2023
@jjerphan jjerphan deleted the feat/pairwise_distances-pdr-backend branch March 9, 2023 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants