Skip to content

PERF PairwiseDistancesReductions subsequent work #25888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 21 tasks
jjerphan opened this issue Mar 17, 2023 · 54 comments
Open
3 of 21 tasks

PERF PairwiseDistancesReductions subsequent work #25888

jjerphan opened this issue Mar 17, 2023 · 54 comments
Labels
cython Meta-issue General issue associated to an identified list of tasks Performance

Comments

@jjerphan
Copy link
Member

jjerphan commented Mar 17, 2023

PairwiseDistancesReductions have been introduced as a hierarchy of Cython classes to implement back-ends of some scikit-learn algorithms.

Initial work was listed in #22587.

💡 See this presentation for a better understanding of the design of PairwiseDistancesReductions.

💡 Before working on improving performance, one must profile the current execution of algorithms to see if there are any substantial benefits.

Subsequent possible work includes:

@github-actions github-actions bot added the Needs Triage Issue requires triage label Mar 17, 2023
@jjerphan jjerphan added cython Long Term Meta-issue General issue associated to an identified list of tasks Performance and removed Needs Triage Issue requires triage labels Mar 17, 2023
@jjerphan
Copy link
Member Author

@OmarManzoor: you also might be interested in items of this issue. 🙂

@OmarManzoor
Copy link
Contributor

@OmarManzoor: you also might be interested in items of this issue. 🙂

Nice. Thanks!

@adam2392
Copy link
Member

  • Support "precomputed" distances

Looking into this...

@jjerphan
Copy link
Member Author

I started working on this item with jjerphan@78b3aa4.

The logic of PairwiseDistancesReductions for metric="precomputed" has to be adapted. Feel free to use another alternative. 🙂

@OmarManzoor
Copy link
Contributor

@jjerphan Would the remaining items in the Implement back-end for other Estimators or Transformers require significant amount of work?

@jjerphan
Copy link
Member Author

jjerphan commented Mar 20, 2023

Basically, some of the interfaces whose performance have been improved in 1.1 with ArgKmin and RadiusNeighbors via kneighbors and radius_neighbors might be optimize-able with dedicated back-end.

Hence, I think this requires a significant amount of work, yes.

Edit: For now, I have added a link to a presentation explaining the design of those implementation in the description of this issue. I think I will write about the design of it (cc @adam2392).

@OmarManzoor
Copy link
Contributor

Basically, some of the interfaces whose performance have been improved in 1.1 with ArgKmin and RadiusNeighbors via kneighbors and radius_neighbors might be optimize-able with dedicated back-end.

Hence, I think this requires a significant amount of work, yes.

Edit: For now, I have added a link to a presentation explaining the design of those implementation in the description of this issue. I think I will write about the design of it (cc @adam2392).

Thank you for sharing the presentation.

@jjerphan
Copy link
Member Author

You are welcome.

@OmarManzoor
Copy link
Contributor

You are welcome.

This looks interesting but I might need time to understand it properly.

@jjerphan
Copy link
Member Author

jjerphan commented Mar 20, 2023

This presentation is a bit out of date.

You might also want to have a look at the elements of documentation present here:

# Overview
# --------
#
# This module provides routines to compute pairwise distances between a set
# of row vectors of X and another set of row vectors of Y and apply a
# reduction on top. The canonical example is the brute-force computation
# of the top k nearest neighbors by leveraging the arg-k-min reduction.
#
# The reduction takes a matrix of pairwise distances between rows of X and Y
# as input and outputs an aggregate data-structure for each row of X. The
# aggregate values are typically smaller than the number of rows in Y, hence
# the term reduction.
#
# For computational reasons, the reduction are performed on the fly on chunks
# of rows of X and Y so as to keep intermediate data-structures in CPU cache
# and avoid unnecessary round trips of large distance arrays with the RAM
# that would otherwise severely degrade the speed by making the overall
# processing memory-bound.
#
# Finally, the routines follow a generic parallelization template to process
# chunks of data with OpenMP loops (via Cython prange), either on rows of X
# or rows of Y depending on their respective sizes.
#
#
# Dispatching to specialized implementations
# ------------------------------------------
#
# Dispatchers are meant to be used in the Python code. Under the hood, a
# dispatcher must only define the logic to choose at runtime to the correct
# dtype-specialized :class:`BaseDistancesReductionDispatcher` implementation based
# on the dtype of X and of Y.
#
#
# High-level diagram
# ------------------
#
# Legend:
#
# A ---⊳ B: A inherits from B
# A ---x B: A dispatches to B
#
#
# (base dispatcher)
# BaseDistancesReductionDispatcher
# ∆
# |
# |
# +-----------------------+----------------------+
# | |
# (dispatcher) (dispatcher)
# ArgKmin RadiusNeighbors
# | |
# | |
# | (float{32,64} implem.) |
# | BaseDistancesReduction{32,64} |
# | ∆ |
# | | |
# | | |
# | +-----------------+-----------------+ |
# | | | |
# | | | |
# x | | x
# ArgKmin{32,64} RadiusNeighbors{32,64}
# | ∆ ∆ |
# | | | |
# ======================= Specializations =============================
# | | | |
# | | | |
# x | | x
# EuclideanArgKmin{32,64} EuclideanRadiusNeighbors{32,64}
#
# For instance :class:`ArgKmin` dispatches to:
# - :class:`ArgKmin64` if X and Y are two `float64` array-likes
# - :class:`ArgKmin32` if X and Y are two `float32` array-likes
#
# In addition, if the metric parameter is set to "euclidean" or "sqeuclidean",
# then `ArgKmin{32,64}` further dispatches to `EuclideanArgKmin{32,64}`. For
# example, :class:`ArgKmin64` would dispatch to :class:`EuclideanArgKmin64`, a
# specialized subclass that optimally handles the Euclidean distance case
# using Generalized Matrix Multiplication over `float64` data (see the
# docstring of :class:`GEMMTermComputer64` for details).

@OmarManzoor
Copy link
Contributor

This presentation is a bit out of date.

You might also want to have a look at the elements of documentation present here:

# Overview
# --------
#
# This module provides routines to compute pairwise distances between a set
# of row vectors of X and another set of row vectors of Y and apply a
# reduction on top. The canonical example is the brute-force computation
# of the top k nearest neighbors by leveraging the arg-k-min reduction.
#
# The reduction takes a matrix of pairwise distances between rows of X and Y
# as input and outputs an aggregate data-structure for each row of X. The
# aggregate values are typically smaller than the number of rows in Y, hence
# the term reduction.
#
# For computational reasons, the reduction are performed on the fly on chunks
# of rows of X and Y so as to keep intermediate data-structures in CPU cache
# and avoid unnecessary round trips of large distance arrays with the RAM
# that would otherwise severely degrade the speed by making the overall
# processing memory-bound.
#
# Finally, the routines follow a generic parallelization template to process
# chunks of data with OpenMP loops (via Cython prange), either on rows of X
# or rows of Y depending on their respective sizes.
#
#
# Dispatching to specialized implementations
# ------------------------------------------
#
# Dispatchers are meant to be used in the Python code. Under the hood, a
# dispatcher must only define the logic to choose at runtime to the correct
# dtype-specialized :class:`BaseDistancesReductionDispatcher` implementation based
# on the dtype of X and of Y.
#
#
# High-level diagram
# ------------------
#
# Legend:
#
# A ---⊳ B: A inherits from B
# A ---x B: A dispatches to B
#
#
# (base dispatcher)
# BaseDistancesReductionDispatcher
# ∆
# |
# |
# +-----------------------+----------------------+
# | |
# (dispatcher) (dispatcher)
# ArgKmin RadiusNeighbors
# | |
# | |
# | (float{32,64} implem.) |
# | BaseDistancesReduction{32,64} |
# | ∆ |
# | | |
# | | |
# | +-----------------+-----------------+ |
# | | | |
# | | | |
# x | | x
# ArgKmin{32,64} RadiusNeighbors{32,64}
# | ∆ ∆ |
# | | | |
# ======================= Specializations =============================
# | | | |
# | | | |
# x | | x
# EuclideanArgKmin{32,64} EuclideanRadiusNeighbors{32,64}
#
# For instance :class:`ArgKmin` dispatches to:
# - :class:`ArgKmin64` if X and Y are two `float64` array-likes
# - :class:`ArgKmin32` if X and Y are two `float32` array-likes
#
# In addition, if the metric parameter is set to "euclidean" or "sqeuclidean",
# then `ArgKmin{32,64}` further dispatches to `EuclideanArgKmin{32,64}`. For
# example, :class:`ArgKmin64` would dispatch to :class:`EuclideanArgKmin64`, a
# specialized subclass that optimally handles the Euclidean distance case
# using Generalized Matrix Multiplication over `float64` data (see the
# docstring of :class:`GEMMTermComputer64` for details).

I think I'll have a look at this after wrapping up the binary tree PR since I have already started it. What do you think?

@jjerphan
Copy link
Member Author

Yes, let's probably finish #25914 first.

@OmarManzoor
Copy link
Contributor

@jjerphan I have started looking at LocalOutlierFactor. The method kneighbors is being used within fit and score_samples. Do we need to try out a dedicated backend for both these cases?

@jjerphan
Copy link
Member Author

Yes.

Similarly to ArgKminClassmode which has been introduced for KNeighbors*.predict_proba by @Micky774, we could imagine back-ends for LocalOutlierFactor.{fit,score_samples}.

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Mar 31, 2023

@jjerphan At this point we actually need the distances of all the points for extracting the distances of neighbors.

dist_k = self._distances_fit_X_[neighbors_indices, self.n_neighbors_ - 1]
reach_dist_array = np.maximum(distances_X, dist_k)

Where would this be available in pairwise distances?

@jjerphan
Copy link
Member Author

jjerphan commented Mar 31, 2023

I do not think we need to compute and store the distance matrices.

The new back-end would change the implementation so that self._distances_fit_X_ returned by kneighbors won't be stored: kneighbors, _local_reachability_density and the computations of lrd_ratios_array should be replaceable by a new back-end having a logic similar to the one of ArgKmin.

This might be a lot as a rework and I have not been able to have a look at it yet, but I think this is feasible.

Let me know if this is clear or not.

@OmarManzoor
Copy link
Contributor

OmarManzoor commented May 30, 2023

Hi @jjerphan , @Micky774

If we want to optimize RadiusNeighborsClassifier's predict_proba and predict methods, should we run some profiling first? This does seem similar to Kneighbor's predict in terms of how the methods are being called.

@jjerphan
Copy link
Member Author

I would run some profiling ahead of implementations, yes.

@OmarManzoor
Copy link
Contributor

I would run some profiling ahead of implementations, yes.

Could you kindly guide me how to run such a profiling? You mentioned that we can use py-spy but should I create a python script to run the classifier's desired methods under py-spy?

@Micky774
Copy link
Contributor

Micky774 commented May 30, 2023

Could you kindly guide me how to run such a profiling? You mentioned that we can use py-spy but should I create a python script to run the classifier's desired methods under py-spy?

Yes, basically. Once you create a python script whose main time bottleneck is the code section you want to profile (ideally the majority of time is taken up by predict_proba) then you can run py-spy record via CLI to analyze it.

In particular you'll want to run something like py-spy record --native -o py-spy.profile -f speedscope -- python <script to profile>

This ensures that native C/Cython functions are included in the profile, and will output a speedscope compatible profile report. Note that that's not the only output format you can use, but it is especially convenient for general profiling.

@kyrajeep
Copy link

I thought supporting "precomputed" distances would be a good task to start with, but after reading the discussion not sure if that is something I can do. Does that part have to be reorganized? Perhaps I should start with a different estimator?

@jjerphan
Copy link
Member Author

jjerphan commented Jun 17, 2024

Hi @kyrajeep ,

Introducing the support for precomputed distance matrices is definitely something you can do. 👍

Note that @adam2392 started this work with #26032. I would recommend discussing the state of this PR before starting anything new.

@kyrajeep
Copy link

Hello @jjerphan @OmarManzoor @ogrisel,
I decided to look at other estimators since precomputed option does not seem to require too much more work.

I have some questions regarding sklearn/neighbors. It seems like _kd_tree.pyx.tp and _ball_tree.pyx.tp are generated from _binary_tree.pxi.tp. Is that the case?

For _graph.py in neighbors, I do not see a .pyx file and so I created one and wrote a little bit of code to get some clarity. There is much communication to catch up on this topic that confuses me sometimes. :) I'd like to keep working on _graph.pyx

What was your and others workflow for _kd_tree.pyx.tp and _binary_tree.pxi.tp? Thanks so much!

@jjerphan
Copy link
Member Author

jjerphan commented Jun 25, 2024

I decided to look at other estimators since precomputed option does not seem to require too much more work.

If we want to do it correctly, it does require a really good amount of work for a first contribution (we need to adapt part of the class hierarchy).

I have some questions regarding sklearn/neighbors. It seems like _kd_tree.pyx.tp and _ball_tree.pyx.tp are generated from _binary_tree.pxi.tp. Is that the case?

The later is included in the two firsts after tempita expension.

@jjerphan
Copy link
Member Author

jjerphan commented Jun 25, 2024

What was your and others workflow for _kd_tree.pyx.tp and _binary_tree.pxi.tp? Thanks so much!

If would recommend starting with using Cython in Jupyter notebook.

@kyrajeep
Copy link

Great, thanks for letting me know. I will take another look and reach out to the previous contributor. That looks like a good place to start!

@kyrajeep
Copy link

kyrajeep commented Jun 27, 2024

I have not got a reply about the status of the PR #26032 yet. How should I proceed? Should I wait longer?

@jjerphan
Copy link
Member Author

Responses in the open-source projects sometimes take weeks. You can start a draft PR, mentioning it there. :)

@adam2392
Copy link
Member

Hi feel free to move the work along @kyrajeep.

I never got around to figuring out how the tempita cython code works.

Found it harder than reading C++ 😝. Hope to revisit this issue though to contribute down the line.

@jjerphan
Copy link
Member Author

I never got around to figuring out how the tempita cython code works.
Found it harder than reading C++ 😝

Some maintainers share this perspective, some do not.

I think there would be benefits to using something else than Tempita, but that would take some time and mental bandwidth (which I don't have personally).

@OmarManzoor
Copy link
Contributor

Plus I think Tempita is used quite a lot now in the project so it seems like a reasonable standard.

@kyrajeep
Copy link

kyrajeep commented Jun 27, 2024

Thank you for the update @adam2392 . @jjerphan, do you have a suggestion for where to start with tempita cython? I did the short tutorial on cython linked above on my local MacBook M1. It was a simple .pyx script with a setup.py.

I have never used tempita before. Thank you 😊

@jjerphan
Copy link
Member Author

Tempita is a micro pre-processor for Cython. We use Tempita to workaround the absence of template class in Cython so that we can handle float32 and float64 while maintaining single implementations.

I would recommend writing implementation for float64 first. Porting an implementation to float32 with Tempita can be inferred easily (e.g. see this simple implementation)

Regarding getting started, I hope the instructions regarding Cython in the documentation of scikit-learn and scikit-learn's tooling are clear and helpful enough.

@kyrajeep
Copy link

Thanks! Yes, this is a great start point.

@kyrajeep
Copy link

kyrajeep commented Jul 2, 2024

I started to work on the issue but realized I did not know what 'precomputed' meant exactly. Does it mean that the user already knows the distance between the two matrices? If so, does that chunk of data bypass computing the distance altogether? Thank you.

@Micky774
Copy link
Contributor

Micky774 commented Jul 3, 2024

I started to work on the issue but realized I did not know what 'precomputed' meant exactly. Does it mean that the user already knows the distance between the two matrices?

@kyrajeep Yes. In that sense, you have an array D where entry D[i, j] is the distance between points X[i], X[j]

If so, does that chunk of data bypass computing the distance altogether? Thank you.

Yes. The goal is to have something that "looks" like the other PairwiseDistanceReductions, but behaves in a very special way. That way, code that expects such an object can be kept the same, but benefits from not having to do any new computation. So, if we normally expect a PairwiseDistanceReduction to "look" like a metric, then we want our special precomputed version to have the same API and usage, but without computation since we already have the answers and just need to "look up" the answers and return them.

I'd strongly recommend starting with the work that @adam2392 has done in #26032 and taking over the PR (while preserving them as a co-author). This means checking out those changes and including them in your own, new branch, and finishing whatever there is left to be done, instead of starting from scratch.

@kyrajeep
Copy link

kyrajeep commented Jul 5, 2024

Thank you, @Micky774 for the clarification. I ran 'gh pr checkout 26032' to have a local copy of the PR by @adam2392. As scikit-learn has other commits, how should I fetch/pull from the remote repo? I have tried this in the past with confusing results at times so I'd appreciate your input. Thank you.

@Micky774
Copy link
Contributor

Micky774 commented Jul 5, 2024

scikit-learn has other commits, how should I fetch/pull from the remote repo? I have tried this in the past with confusing results at times so I'd appreciate your input.

In general you should be able to cleanly merge changes from the upstream main branch (except maybe a few simple conflicts here and there). So e.g.

git checkout <PR branch>
git merge main

if there are big conflicts, they can be analyzed individually

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jul 22, 2024

Hi @jjerphan

I was experimenting with mst_linkage_core
and it seems if we want to adapt a pairwise distances implementation for this function, we need to compute all the distances before handling the actual conditional code in the inner loop. This is because I don't think it is possible to share variables like current_node, new_distance and new_node between different threads.

On the other hand the issue with computing the complete distance matrix is that it would require considerably higher memory for larger data sizes, even if it allows us to speed up the code.

What do you think regarding this?

@jjerphan
Copy link
Member Author

Hi Omar,

I need to have a look at it in detail to be able to tell.

@OmarManzoor
Copy link
Contributor

I need to have a look at it in detail to be able to tell.

Sure

@jjerphan
Copy link
Member Author

I do not have time to have a look at it, unfortunately.

@OmarManzoor
Copy link
Contributor

I do not have time to have a look at it, unfortunately.

No problem at all.

@kyrajeep
Copy link

kyrajeep commented Nov 18, 2024

The tests have been added for #29483 and I believe the code is ready for a review! Sorry it took a while!

@kyrajeep
Copy link

kyrajeep commented Dec 9, 2024

@jjerphan, the review comments for #29483 have been addressed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython Meta-issue General issue associated to an identified list of tasks Performance
Projects
None yet
Development

No branches or pull requests

6 participants