Skip to content

ENH Euclidean specialization of DatasetsPair instead of ArgKmin and RadiusNeighbors #25170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Vincent-Maladiere
Copy link
Contributor

@Vincent-Maladiere Vincent-Maladiere commented Dec 12, 2022

Reference Issues/PRs

Follow up of #25044

Redesign DatasetsPair w.r.t the new MiddleTermComputer to remove the duplicated logic and to clarify responsibilities > (esp. squared euclidean norm computations)

What does this implement/fix? Explain your changes.

This PR removes the euclidean specialization logic from EuclideanArgKmin and EuclideanRadiusNeighbors to Euclidean{DenseDense, SparseSparse}DatasetsPair.

This is how EuclideanArgKmin and EuclideanRadiusNeighbors are currently tied to MiddleTermComputer:

image (1)

This is how this PR suggests removing EuclideanArgKmin and EuclideanRadiusNeighbors and introducing Euclidean{DenseDense, SparseSparse}DatasetsPair instead:

image

Any other comments?

Done

  • DatasetPairs is instantiated during the __init__ of BaseDistancesReduction because some parameters computed during the latter are needed in the former.
  • {DenseDense, SparseSparse}MiddleTermComputer are instantiated directly during the __init__ of Euclidean{DenseDense, SparseSparse}DatasetsPair, removing the need for a get_for classmethod to dispatch cases in MiddleTermComputer
  • parallel_on_{X, Y}_pre_compute_and_reduce_distances_on_chunks() in {ArgKmin, RadiusNeighbors} computes and stores dist_middle_term within MiddleTermComputer.
  • Calling ArgKmin.surrogate_dist() or RadiusNeighbors.surrogate_dist() performs a call to DatasetsPair and then to MiddleTermComputer to get the dist_middle_term quantity.

TODO

cc @jjerphan (and @Arnaud15 who is interested in this PR).

  • See this note for more details on Euclidean Specialization (in french).

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Vincent-Maladiere for this significant rework! 💯

The design looks good and implementable at first sight.

I think we better check for any performance regression now before proceeding. In this regards, I think scikit-learn/pairwise-distances-reductions-asv-suite can help you.

I recommend reducing the parametrisation of the benchmark to cover:

  • metric="euclidean"
  • strategy="auto"
  • dtype in [float32, float64]
  • (X_train, X_test) in (("dense", "dense"), ("csr", "csr"))
  • return_distance=True

and running them only for time_ArgKmin, which can done using:

asv continuous --verbose --show-stderr --split \
    --bench "PairwiseDistancesReductionsBenchmark.time_ArgKmin" \
    main refs/pull/25170/head

On a machine using 128 physical cores, this might takes easily 10 hours to run (because the timeout default values has been changed from 500 to 1000).

You might want to start with quick runs to check for any problems or for a rough assessment of performance assessement. For this, I recommend:

  • running the benchmark on pairs of datasets which small up to medium number of samples (i.e. 1000 up to 100_000)
  • adding the --quick option for the asv continuous command above to only run each combinations only once.
  • to decrease the timeout down to a few hundreds seconds

In the meantime, here are a few comment on this first iteration.

@@ -417,8 +350,12 @@ cdef class SparseSparseMiddleTermComputer{{name_suffix}}(MiddleTermComputer{{nam

def __init__(
self,
X,
Y,
const DTYPE_t[:] X_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const DTYPE_t[:] X_data,
const {{INPUT_DTYPE_t}}[:] X_data,

const DTYPE_t[:] X_data,
const SPARSE_INDEX_TYPE_t[:] X_indices,
const SPARSE_INDEX_TYPE_t[:] X_indptr,
const DTYPE_t[:] Y_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const DTYPE_t[:] Y_data,
const {{INPUT_DTYPE_t}}[:] Y_data,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is now fixed.

This one is interesting, though, because in the main branch we currently cast X_data and Y_data to float64, and we call the routine sparse_sparse_middle_term_computation_64 for both SparseMiddleTermComputer{32, 64}.

I saw that you changed this behavior in your work on SparseDenseDatasetsPair to avoid casting. Could you provide more details about this decision?

Comment on lines 728 to 734
self.middle_term_computer._compute_dist_middle_terms(
X_start,
X_end,
Y_start,
Y_end,
thread_num,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should rather be moved into MiddleTermComputer._parallel_on_Y_pre_compute_and_reduce_distances_on_chunks.

Comment on lines 680 to 686
self.middle_term_computer._compute_dist_middle_terms(
X_start,
X_end,
Y_start,
Y_end,
thread_num,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should rather be moved into MiddleTermComputer._parallel_on_X_pre_compute_and_reduce_distances_on_chunks.

):
super().__init__(X, Y, distance_metric)

# Used to compute the surrogate distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be read as:

It used to compute the surrogate distance.

rather than (what I think you meant):

It is used to compute the surrogate distance.

Moreover, can you explain how this is being used?

Comment on lines +137 to +138
# several orders of magnitude compared to the generic {ArgKmin, RadiusNeighbors}
# implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# several orders of magnitude compared to the generic {ArgKmin, RadiusNeighbors}
# implementation.
# several orders of magnitude compared to the generic
# {ArgKmin, RadiusNeighbors} implementations.

Comment on lines +126 to +127
if use_euclidean_specialization:
use_squared_distances = metric == "sqeuclidean"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right hand side expression can be made inline for the two calls with the use_squared_distances hereinafter and those two lines be removed.

Comment on lines 71 to +72
dict metric_kwargs=None,
dict euclidean_kwargs=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would combine both extra kwargs dictionary here in a single metric_kwargs.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would make sense and also would be better than having two kwargs.
We need to extract variables related to MiddleTermComputer from metric_kwargs before passing metric_kwargs to DistanceMetric during DatasetsPair.get_for().

)


cdef class EuclideanRadiusNeighbors{{name_suffix}}(RadiusNeighbors{{name_suffix}}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again: So noice. 💯

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few other comments on top of the older ones.

Comment on lines 146 to 151
X_start,
Y_start,
X_start + i,
Y_start + j,
i,
j,
n_samples_Y,
thread_num,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use keyword arguments, here?

This comment also applies to other places calling DatasetsPair.surrogate_dist.

@@ -41,6 +41,7 @@ cdef class DatasetsPair{{name_suffix}}:
ITYPE_t Y_start,
ITYPE_t i,
ITYPE_t j,
ITYPE_t n_Y,
ITYPE_t thread_num=*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to use * as a default, here? What is the semantic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added * so that surrogate_dist could take 0 as the default for thread_num. Since thread_num is a new argument of surrogate_dist, I've set a default parameter to avoid impacting other calls. WDYT?

@jjerphan
Copy link
Member

jjerphan commented Dec 19, 2022

@Vincent-Maladiere: @Micky774 has been working on
Micky774#7 to meet similar needs for #24076.

As discussed with @Micky774 on a call, both of you might be interested in syncing up since this PR and the first linked above since they are highly similar in scope.

@Vincent-Maladiere
Copy link
Contributor Author

I ran ASV benchmark between main and this PR branch @63bba0cd. Using 128 cores for the dense-dense case of ArgKmin only, time performances are significantly worse for 5 cases out of 18.

I will run profiling via py-spy / speedscope on this branch to compare it with main so that we can locate the bottleneck. @jjerphan WDYT?

before           after         ratio
     [ce89a4ff]       [63bba0cd]
     <main>           <pull/25170/head>
+         257±2ms         385±10ms     1.50  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_ArgKmin(10000, 100000, 100, 'euclidean', 'auto', <class 'numpy.float64'>, 'dense', 'dense')
+        283±10ms         396±10ms     1.40  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_ArgKmin(10000, 100000, 100, 'euclidean', 'auto', <class 'numpy.float32'>, 'dense', 'dense')
+         3.97±0m          5.24±0m     1.32  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_ArgKmin(10000000, 100000, 100, 'euclidean', 'auto', <class 'numpy.float64'>, 'dense', 'dense')
+         4.03±0m          4.96±0m     1.23  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_ArgKmin(10000000, 100000, 100, 'euclidean', 'auto', <class 'numpy.float32'>, 'dense', 'dense')
+         27.3±0s          32.9±0s     1.20  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_ArgKmin(10000000, 10000, 100, 'euclidean', 'auto', <class 'numpy.float64'>, 'dense', 'dense')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
Full ASV results
[ 0.00%] · For scikit-learn commit 63bba0cd <pull/25170/head> (round 1/1):
[50.00%] ··· ========== ======== ============ =========== ========== =============== ========= ======== ============
              n_train    n_test   n_features     metric    strategy       dtype       X_train   X_test
             ---------- -------- ------------ ----------- ---------- --------------- --------- -------- ------------
                1000      1000       100       euclidean     auto     numpy.float32    dense    dense     224±4ms
                1000      1000       100       euclidean     auto     numpy.float64    dense    dense     206±4ms
                1000     10000       100       euclidean     auto     numpy.float32    dense    dense     104±2ms
                1000     10000       100       euclidean     auto     numpy.float64    dense    dense     88.2±3ms
                1000     100000      100       euclidean     auto     numpy.float32    dense    dense     152±9ms
                1000     100000      100       euclidean     auto     numpy.float64    dense    dense     103±20ms
               10000      1000       100       euclidean     auto     numpy.float32    dense    dense     229±2ms
               10000      1000       100       euclidean     auto     numpy.float64    dense    dense     223±20ms
               10000     10000       100       euclidean     auto     numpy.float32    dense    dense    1.47±0.06s
               10000     10000       100       euclidean     auto     numpy.float64    dense    dense    1.31±0.06s
               10000     100000      100       euclidean     auto     numpy.float32    dense    dense     396±10ms
               10000     100000      100       euclidean     auto     numpy.float64    dense    dense     385±10ms
              10000000    1000       100       euclidean     auto     numpy.float32    dense    dense    3.19±0.1s
              10000000    1000       100       euclidean     auto     numpy.float64    dense    dense    3.49±0.01s
              10000000   10000       100       euclidean     auto     numpy.float32    dense    dense     30.0±0s
              10000000   10000       100       euclidean     auto     numpy.float64    dense    dense     32.9±0s
              10000000   100000      100       euclidean     auto     numpy.float32    dense    dense     4.96±0m
              10000000   100000      100       euclidean     auto     numpy.float64    dense    dense     5.24±0m
             ========== ======== ============ =========== ========== =============== ========= ======== ============

[50.00%] · For scikit-learn commit ce89a4ff <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· ========== ======== ============ =========== ========== =============== ========= ======== ============
               n_train    n_test   n_features     metric    strategy       dtype       X_train   X_test
              ---------- -------- ------------ ----------- ---------- --------------- --------- -------- ------------
                 1000      1000       100       euclidean     auto     numpy.float32    dense    dense     222±3ms
                 1000      1000       100       euclidean     auto     numpy.float64    dense    dense     204±6ms
                 1000     10000       100       euclidean     auto     numpy.float32    dense    dense     98.3±1ms
                 1000     10000       100       euclidean     auto     numpy.float64    dense    dense     88.3±1ms
                 1000     100000      100       euclidean     auto     numpy.float32    dense    dense     130±20ms
                 1000     100000      100       euclidean     auto     numpy.float64    dense    dense    91.1±30ms
                10000      1000       100       euclidean     auto     numpy.float32    dense    dense     234±2ms
                10000      1000       100       euclidean     auto     numpy.float64    dense    dense     227±5ms
                10000     10000       100       euclidean     auto     numpy.float32    dense    dense    1.41±0.04s
                10000     10000       100       euclidean     auto     numpy.float64    dense    dense    1.22±0.05s
                10000     100000      100       euclidean     auto     numpy.float32    dense    dense     283±10ms
                10000     100000      100       euclidean     auto     numpy.float64    dense    dense     257±2ms
               10000000    1000       100       euclidean     auto     numpy.float32    dense    dense    2.93±0.01s
               10000000    1000       100       euclidean     auto     numpy.float64    dense    dense     2.93±0s
               10000000   10000       100       euclidean     auto     numpy.float32    dense    dense     27.6±0s
               10000000   10000       100       euclidean     auto     numpy.float64    dense    dense     27.3±0s
               10000000   100000      100       euclidean     auto     numpy.float32    dense    dense     4.03±0m
               10000000   100000      100       euclidean     auto     numpy.float64    dense    dense     3.97±0m
              ========== ======== ============ =========== ========== =============== ========= ======== ============

@jjerphan
Copy link
Member

jjerphan commented Dec 20, 2022

@Vincent-Maladiere: thanks for reporting those results!

Like you have proposed, I would profile the first reported case on the two commits (i.e. 63bba0cd for this branch i.e. pull/25170/head and for main using py-spy on your local machine, that is:

+         257±2ms         385±10ms     1.50  pairwise_distances_reductions.PairwiseDistancesReductionsBenchmark.time_ArgKmin(10000, 100000, 100, 'euclidean', 'auto', <class 'numpy.float64'>, 'dense', 'dense')

Native code can be profiled and results exported for SpeedScope using:

py-spy record --native -o py-spy.profile -f speedscope -- python ./script.py

Profiles' files can be uploaded here so that they be inspected by readers as well.

@Vincent-Maladiere
Copy link
Contributor Author

Vincent-Maladiere commented Dec 22, 2022

For future readers, the py-spy --native option that collects traces in Cython, C, or C++ is unavailable on macOS.

I ran the following benchmark on a single thread to ease the understanding with py-spy:

export OMP_NUM_THREADS=1

import numpy as np
import threadpoolctl

from sklearn.metrics._pairwise_distances_reduction import ArgKmin

rng = np.random.RandomState(42)

dtype = np.float64
n_train, n_test, n_features = 10_000, 100_000, 100
X_train = rng.randn(n_train, n_features).astype(dtype)
X_test = rng.randn(n_test, n_features).astype(dtype)

controler = threadpoolctl.ThreadpoolController()
with controler.limit(limits=1, user_api=None):

    ArgKmin.compute(
        X=X_test,
        Y=X_train,
        k=10,
        metric="euclidean",
        return_distance=True,
        strategy="auto",
    )

GitHub can't handle py-spy SVG profiles, so I placed them on Dropbox. You can download them at:
py-spy-main
py-spy-revamp

and upload each profile on speedscope.app.

Using the left-heavy view of speedscope, one can see that EuclideanDenseDenseDatasetsPair.surrogate_dist() takes up to twice as long to execute as our current EuclideanArgKmin.compute_and_reduce_distances_on_chunks().

So, in an attempt to optimize EuclideanDenseDenseDatasetsPair.surrogate_dist(), I moved dist_middle_terms_chunks from MiddleTermComputer to EuclideanDenseDenseDatasetsPair so that surrogate_dist() can fetch dist_middle_terms_chunks from self directly, without having to access it through self.middle_term_computer.

However, after a new ASV benchmark on 64 cores, we still have a similar performance decrease.

Another possible explanation for this decrease is that we make a lot of function calls to surrogate_dist and that might be inefficient (due to some overhead). I'm now trying to reduce the logic of this function as a simple lookup from a pre-computed vector.

WDYT @jjerphan?

@jjerphan
Copy link
Member

jjerphan commented Jan 2, 2023

Thanks for reporting this, @Vincent-Maladiere.

Another possible explanation for this decrease is that we make a lot of function calls to surrogate_dist and that might be inefficient (due to some overhead). I'm now trying to reduce the logic of this function as a simple lookup from a pre-computed vector.

I think this might be a reason for the performance regression, still when looking at the two py-spy you report with SpeedScope, it seems that the GEMM kernel (dgemm_kernel_PRESCOTT) also takes more time to run. Potentially, the latests changes that you have performed changed the behavior of the implementation with respect to data and cache locality: this can get up to flushing the L1 instructions cache which previously might have stored instruction for the kernel but which might not anymore.

I think we need to spend some time inspecting it with other tools like perf(1) or Intel VTune.

@jjerphan
Copy link
Member

Thank you for exploring this, @Vincent-Maladiere.

This allowed us to understand the tradeoff of flexible design vs performance impact due to methods' dispatch, making us converge on #25044.

I am closing this PR for now, probably we could reopen it if dispatching methods becomes less costly in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants