-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
PERF PairwiseDistancesReductions
subsequent work
#25888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@OmarManzoor: you also might be interested in items of this issue. 🙂 |
Nice. Thanks! |
Looking into this... |
I started working on this item with jjerphan@78b3aa4. The logic of |
@jjerphan Would the remaining items in the Implement back-end for other Estimators or Transformers require significant amount of work? |
Basically, some of the interfaces whose performance have been improved in 1.1 with Hence, I think this requires a significant amount of work, yes. Edit: For now, I have added a link to a presentation explaining the design of those implementation in the description of this issue. I think I will write about the design of it (cc @adam2392). |
Thank you for sharing the presentation. |
You are welcome. |
This looks interesting but I might need time to understand it properly. |
This presentation is a bit out of date. You might also want to have a look at the elements of documentation present here:
|
I think I'll have a look at this after wrapping up the binary tree PR since I have already started it. What do you think? |
Yes, let's probably finish #25914 first. |
@jjerphan I have started looking at LocalOutlierFactor. The method kneighbors is being used within fit and score_samples. Do we need to try out a dedicated backend for both these cases? |
Yes. Similarly to |
@jjerphan At this point we actually need the distances of all the points for extracting the distances of neighbors. scikit-learn/sklearn/neighbors/_lof.py Lines 506 to 507 in c3bfe86
Where would this be available in pairwise distances? |
I do not think we need to compute and store the distance matrices. The new back-end would change the implementation so that This might be a lot as a rework and I have not been able to have a look at it yet, but I think this is feasible. Let me know if this is clear or not. |
I would run some profiling ahead of implementations, yes. |
Could you kindly guide me how to run such a profiling? You mentioned that we can use py-spy but should I create a python script to run the classifier's desired methods under py-spy? |
Yes, basically. Once you create a python script whose main time bottleneck is the code section you want to profile (ideally the majority of time is taken up by In particular you'll want to run something like This ensures that native C/Cython functions are included in the profile, and will output a speedscope compatible profile report. Note that that's not the only output format you can use, but it is especially convenient for general profiling. |
I thought supporting "precomputed" distances would be a good task to start with, but after reading the discussion not sure if that is something I can do. Does that part have to be reorganized? Perhaps I should start with a different estimator? |
Hello @jjerphan @OmarManzoor @ogrisel, I have some questions regarding sklearn/neighbors. It seems like _kd_tree.pyx.tp and _ball_tree.pyx.tp are generated from _binary_tree.pxi.tp. Is that the case? For _graph.py in neighbors, I do not see a .pyx file and so I created one and wrote a little bit of code to get some clarity. There is much communication to catch up on this topic that confuses me sometimes. :) I'd like to keep working on _graph.pyx What was your and others workflow for _kd_tree.pyx.tp and _binary_tree.pxi.tp? Thanks so much! |
If we want to do it correctly, it does require a really good amount of work for a first contribution (we need to adapt part of the class hierarchy).
The later is included in the two firsts after tempita expension. |
If would recommend starting with using Cython in Jupyter notebook. |
Great, thanks for letting me know. I will take another look and reach out to the previous contributor. That looks like a good place to start! |
I have not got a reply about the status of the PR #26032 yet. How should I proceed? Should I wait longer? |
Responses in the open-source projects sometimes take weeks. You can start a draft PR, mentioning it there. :) |
Hi feel free to move the work along @kyrajeep. I never got around to figuring out how the tempita cython code works. Found it harder than reading C++ 😝. Hope to revisit this issue though to contribute down the line. |
Some maintainers share this perspective, some do not. I think there would be benefits to using something else than Tempita, but that would take some time and mental bandwidth (which I don't have personally). |
Plus I think Tempita is used quite a lot now in the project so it seems like a reasonable standard. |
Tempita is a micro pre-processor for Cython. We use Tempita to workaround the absence of template class in Cython so that we can handle I would recommend writing implementation for Regarding getting started, I hope the instructions regarding Cython in the documentation of scikit-learn and scikit-learn's tooling are clear and helpful enough. |
Thanks! Yes, this is a great start point. |
I started to work on the issue but realized I did not know what 'precomputed' meant exactly. Does it mean that the user already knows the distance between the two matrices? If so, does that chunk of data bypass computing the distance altogether? Thank you. |
@kyrajeep Yes. In that sense, you have an array
Yes. The goal is to have something that "looks" like the other I'd strongly recommend starting with the work that @adam2392 has done in #26032 and taking over the PR (while preserving them as a co-author). This means checking out those changes and including them in your own, new branch, and finishing whatever there is left to be done, instead of starting from scratch. |
Thank you, @Micky774 for the clarification. I ran 'gh pr checkout 26032' to have a local copy of the PR by @adam2392. As scikit-learn has other commits, how should I fetch/pull from the remote repo? I have tried this in the past with confusing results at times so I'd appreciate your input. Thank you. |
In general you should be able to cleanly merge changes from the upstream main branch (except maybe a few simple conflicts here and there). So e.g.
if there are big conflicts, they can be analyzed individually |
Hi @jjerphan I was experimenting with mst_linkage_core On the other hand the issue with computing the complete distance matrix is that it would require considerably higher memory for larger data sizes, even if it allows us to speed up the code. What do you think regarding this? |
Hi Omar, I need to have a look at it in detail to be able to tell. |
Sure |
I do not have time to have a look at it, unfortunately. |
No problem at all. |
The tests have been added for #29483 and I believe the code is ready for a review! Sorry it took a while! |
PairwiseDistancesReductions
have been introduced as a hierarchy of Cython classes to implement back-ends of some scikit-learn algorithms.Initial work was listed in #22587.
💡 See this presentation for a better understanding of the design of
PairwiseDistancesReductions
.💡 Before working on improving performance, one must profile the current execution of algorithms to see if there are any substantial benefits.
Subsequent possible work includes:
PairwiseDistances
, a generic back-end forpairwise_distances
#26983PairwiseDistancesReductions
: support for BooleanDistanceMetrics
via stable simultaneous sort #25097"precomputed"
distancesEstimators
orTransformers
, in particular, introduce a specialised back-end for which would remove the costly sequential portion around the current call tokneighbors
orradius_neighbors
for (non-complete list):mst_from_data_matrix
) see DOC Update_hdbscan/_linkage.pyx
with new inline comments #25656 (review)mst_linkage_core
)LocalOutlierFactor
(no further optimization possible)KNeighbors*.predict*
PairwiseDistancesReduction
backend forKNeighbors.predict_proba
#24076 by @Micky774"distance"
weightingRadiusNeighbors*.predict*
PairwiseDistancesReduction
backend forRadiusNeighbors.predict_proba
#26828"distance"
weightingKMeans
PairwiseDistancesReductions
for F-contiguous arraysparallel_on_{X|Y}
instead of releasing it in those methodsmimalloc
to have proper memory allocation for multi-threaded implementations:malloc(1)
might come with unexpected changes and might break the whole ecosystem. Also maintaining it might be costly.assert_argkmin_results_quasi_equality
error message #27281 (comment)PairwiseDistancesReductions
?pairwise_kernels
-based interfacesThe text was updated successfully, but these errors were encountered: