FEA Introduce `PairwiseDistances` #23958

jjerphan · 2022-07-19T18:19:52Z

Reference Issues/PRs

Relates to #22587.

What does this implement/fix? Explain your changes.

This adds a new back-end to pairwise_distances computations using PairwiseDistances without any reduction.

Any other comments?

TODO:

merge TST use global_dtype in sklearn/metrics/tests/test_pairwise.py #22666
agree on a resolution for RFC Should pairwise_distances preserve float32 ? #24502 and implements its declination for this back-end
perform benchmarks

ಠ_ರೃ

jjerphan · 2022-07-29T08:16:53Z

Failures on this PR might are fixed by #23990.

Switch condition for safety

Comes with minor adaptations

This: - decreases the number of features by an order to magnetude because in the case of float32, the vectors gets entirely copied for the upcast to float64. This might use too much memory and crash the program - this now accepts the previously xfail parametrisation case by setting on absolute error (seen we are comparing small values)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel

Some more feedback below:

sklearn/metrics/pairwise.py

ogrisel · 2022-09-30T13:48:01Z

sklearn/metrics/_pairwise_distances_reduction/_pairwise_distances.pyx.tp

+                f"usable for this case (EuclideanPairwiseDistances{{name_suffix}}) and will be ignored.",
+                UserWarning,
+                stacklevel=3,
+            )


I would rather avoid introducing a new warning that we want to actually get rid off, so better do this as part of the current PR.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan · 2022-10-12T14:51:50Z

Using perf(1), we can see that when running pairwise_distances(X, Y, dist="manhattan"), 98% of the time is spent computing distances, i.e. in ManhattanDistance{32,64}.dist:

scikit-learn/sklearn/metrics/_dist_metrics.pyx.tp

Lines 1110 to 1120 in e41753e

    
           cdef inline DTYPE_t dist( 
        
               self, 
        
               const {{INPUT_DTYPE_t}}* x1, 
        
               const {{INPUT_DTYPE_t}}* x2, 
        
               ITYPE_t size, 
        
           ) nogil except -1: 
        
               cdef DTYPE_t d = 0 
        
               cdef cnp.intp_t j 
        
               for j in range(size): 
        
                   d += fabs(x1[j] - x2[j]) 
        
               return d

I am not entirely but looking at cycles spent on the assembly code line, we can see that loop unrolling + SIMD is could used (here, SSE registers are used on scalars):

Percent│
       │    Disassembly of section .text:
       │
       │    00000000000104b0 <__pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist>:
       │    __pyx_f_7sklearn_7metrics_13_dist_metrics_17ManhattanDistance_dist():
  0.30 │      test   %rcx,%rcx
  0.04 │    ↓ jle    40
  0.00 │      movq   __pyx_k_C+0x6b4b,%xmm2
  0.01 │      xor    %eax,%eax
       │      pxor   %xmm1,%xmm1
  0.00 │      nop
  0.00 │18:   movsd  (%rsi,%rax,8),%xmm0
  0.28 │      subsd  (%rdx,%rax,8),%xmm0
  0.25 │      add    $0x1,%rax
  0.04 │      andpd  %xmm2,%xmm0
  0.18 │      addsd  %xmm0,%xmm1
+98.76 │      cmp    %rax,%rcx
  0.10 │    ↑ jne    18
  0.00 │      movapd %xmm1,%xmm0
  0.04 │    ← retq
       │      nop
       │40:   pxor   %xmm1,%xmm1
       │      movapd %xmm1,%xmm0
       │    ← retq

It's insightful to see how it's done, and to assess that 98.76% the time here is spent checking if j has reached the range bound, size. I do not know what causes the execution to be spent mostly on this comparison.

ogrisel · 2022-10-17T08:57:17Z

As discussed in real life, it might be interesting to see of the chunking is not detrimental for this operation: indeed chunking will cause non-contiguous writing of the distance values in the distance matrix output array.

It might be worthwhile to conduct dedicated benchmarks to use a chunk size of 1 to see if contiguous writing is beneficial or not.

jjerphan

I just propose to add some references to #24745 in TODO comments.

sklearn/metrics/pairwise.py

jjerphan · 2022-10-28T08:21:07Z

It might be worthwhile to conduct dedicated benchmarks to use a chunk size of 1 to see if contiguous writing is beneficial or not.

My first intuition is that simply setting chunk_size=1 might help but we will still relies on a rather to complicated scheduling for this case: some details like the heuristic will be adapted, we might spend a lot of time jumping in many places due to singletonic chunks, etc..

I think _parallel_on_{X,Y} in this case can be made relatively simpler (namely it would just generalised the current _sparse_manhattan logic).

What do you think?

…nces-pdr-backend

jjerphan · 2022-11-22T10:36:46Z

I am removing it from the 1.2 milestone as it needs more thought and thread-offs' assessments.

jjerphan · 2023-01-05T09:25:44Z

Turning this into a draft for two reasons; to me:

this brings less value to end-users compared to other work (some of which is part of PERF PairwiseDistancesReductions initial work #22587) that has been started in the meantime
there exists alternative implementations that are more lightweight and that do not necessitate the whole infrastructure for the distance matrix chunking; see FEA Introduce PairwiseDistances #23958 (comment)

jjerphan · 2023-02-28T08:23:39Z

Closing since this has been superseded by #25561.

jjerphan added 7 commits June 22, 2022 17:03

Introduce PairwiseDistances

0383b6d

Merge branch 'main' into maint/pairwise_distances-pdr-backend

e1bf6c9

WIP

5889e6f

Merge branch 'main' into maint/pairwise_distances-pdr-backend

82347ff

fixup! Introduce PairwiseDistances

9101daf

Post-merge fix

dff1aa2

ಠ_ರೃ

Merge branch 'main' into feat/pairwise_distances-pdr-backend

78f066c

github-actions bot added cython module:metrics labels Jul 19, 2022

jjerphan mentioned this pull request Jul 19, 2022

PERF PairwiseDistancesReductions initial work #22587

Closed

jjerphan added the No Changelog Needed label Jul 19, 2022

Merge branch 'main' into feat/pairwise_distances-pdr-backend

44d9453

Merge branch 'main' into feat/pairwise_distances-pdr-backend

26f53a8

ogrisel mentioned this pull request Aug 23, 2022

MAINT Adapt DistanceMetric in prevision for fused sparse-dense support #24223

Merged

jjerphan added 15 commits August 26, 2022 11:28

Merge branch 'main' into feat/pairwise_distances-pdr-backend

726858b

fixup! Merge branch 'main' into feat/pairwise_distances-pdr-backend

f573a59

Switch condition for safety

Merge branch 'main' into feat/pairwise_distances-pdr-backend

24fa994

Do not offset by X_start and Y_start

8e17871

Do not progated metric_kwargs unneedlessly

1f1a3ce

Merge branch 'main' into feat/pairwise_distances-pdr-backend

800e6b3

Use the proper vectors' indices

045a7b2

Port PairwiseDistances to support float32 datasets

2b70db8

Simplify indices

f9431cc

Adapt implementations for a previous 'sqeuclidean' specification

24cfd04

TST Downcast distance matrix to float32

527a43c

Adapt instanciation

d26fafb

Use PairwiseDistances as a back-end for haversine_distances

81b74e1

Use PairwiseDistances as a back-end for manhattan_distances

0aa688e

Use PairwiseDistances as a back-end for euclidean_distances

0006764

Comes with minor adaptations

jjerphan and others added 6 commits September 28, 2022 17:58

Safely pack {X,Y}_squared_norms

9017aa7

DOC Improve docstrings

91b3205

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Simplify manhattan_distances

0abe560

fixup! DOC Improve docstrings

b5003f9

fixup! TST Adapt test_euclidean_distances_extreme_values

f38b010

ogrisel reviewed Sep 30, 2022

View reviewed changes

jjerphan mentioned this pull request Oct 7, 2022

Remove sum_over_features from sklearn.metrics.manhattan_distances #24597

Closed

jjerphan and others added 4 commits October 7, 2022 10:50

Apply review comments

f2d8cbe

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into feat/pairwise_distances-pdr-backend

d0196f1

Rework poping {X,Y}_norm_squared in DatasetsPair.get_for

3673479

Rework poping {X,Y}_norm_squared in EuclideanPairwiseDistances

0c74856

jeremiedbb added this to the 1.2 milestone Oct 13, 2022

jjerphan commented Oct 24, 2022

View reviewed changes

sklearn/metrics/pairwise.py Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Show resolved Hide resolved

sklearn/metrics/pairwise.py Show resolved Hide resolved

jjerphan added 2 commits October 24, 2022 17:37

DOC Add references to scikit-learn#24745

4fccb5f

Merge branch 'main' into feat/pairwise_distances-pdr-backend

e58de02

Merge remote-tracking branch 'upstream/main' into feat/pairwise_dista…

eabe44d

…nces-pdr-backend

jjerphan removed this from the 1.2 milestone Nov 22, 2022

jjerphan marked this pull request as draft January 5, 2023 09:25

This was referenced Feb 4, 2023

FEA Introduce PairwiseDistances Vincent-Maladiere/scikit-learn#1

Closed

FEA Introduce PairwiseDistances, a generic back-end for pairwise_distances #25561

Closed

jjerphan closed this Feb 28, 2023

jjerphan deleted the feat/pairwise_distances-pdr-backend branch March 9, 2023 17:40

Micky774 mentioned this pull request Aug 1, 2023

FEA Introduce PairwiseDistances, a generic back-end for pairwise_distances #26983

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FEA Introduce `PairwiseDistances` #23958

FEA Introduce `PairwiseDistances` #23958

Uh oh!

jjerphan commented Jul 19, 2022 •

edited

Loading

Uh oh!

jjerphan commented Jul 29, 2022

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

ogrisel Sep 30, 2022

Uh oh!

jjerphan commented Oct 12, 2022 •

edited

Loading

Uh oh!

ogrisel commented Oct 17, 2022

Uh oh!

jjerphan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan commented Oct 28, 2022

Uh oh!

jjerphan commented Nov 22, 2022

Uh oh!

jjerphan commented Jan 5, 2023

Uh oh!

jjerphan commented Feb 28, 2023

Uh oh!

Uh oh!

Uh oh!

FEA Introduce PairwiseDistances #23958

FEA Introduce PairwiseDistances #23958

Uh oh!

Conversation

jjerphan commented Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jjerphan commented Jul 29, 2022

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel Sep 30, 2022

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Oct 17, 2022

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan commented Oct 28, 2022

Uh oh!

jjerphan commented Nov 22, 2022

Uh oh!

jjerphan commented Jan 5, 2023

Uh oh!

jjerphan commented Feb 28, 2023

Uh oh!

Uh oh!

FEA Introduce `PairwiseDistances` #23958

FEA Introduce `PairwiseDistances` #23958

jjerphan commented Jul 19, 2022 •

edited

Loading

jjerphan commented Oct 12, 2022 •

edited

Loading