-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH add pairwise distances backend for LocalOutlierFactor #26316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add pairwise distances backend for LocalOutlierFactor #26316
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there @OmarManzoor, always wonderful seeing a PR from you :)
I know you are still early in the process of finalizing this PR and it is only marked as a draft, but I figured some (potentially unsolicited) advice may help smooth out the road in front of you.
I also wanted to go ahead and mention that this PR will require some thorough benchmarking just to confirm that there is indeed a performance gain, and to potentially evaluate what the best deployment strategies are (i.e. parallel_on_{X, Y}
) since the asymmetry of the algorithm may require us to adopt a different heuristic than what is the default for ArgKmin
.
Please do not hesitate to ping if you have any questions or concerns. Thanks!
sklearn/metrics/_pairwise_distances_reduction/_argkmin_lrd.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_argkmin_lrd.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_argkmin_lrd.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_argkmin_lrd.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_argkmin_lrd.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_argkmin_lrd.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_argkmin_lrd.pyx.tp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far. Aside from the missing test, please ping me once you have some benchmark results ready :)
raise ValueError( | ||
"Only float64 or float32 datasets pairs are supported at this time, " | ||
f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a test for this
@OmarManzoor: you can use asv to easily perform benchmarks between those revisions. You can take some inspiration from those ASV benchmarks definitions and adapt them for |
Right thank you. |
@jjerphan Could you kindly have a look if this benchmark seems correct argkmin_lrd_benchmark? |
I think this is going to take a lot of time to run. |
@jjerphan @Micky774 These are the results that I got so far
I don't think this looks good. The performance seems to have decreased. |
Can you try profiling (e.g. with Linux |
Thank you! |
@Micky774 I tried using cProfile and these are the results for the argkmin_lrd.pyx file Codeimport cProfile
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
n_train = 1000
n_features = 100
rng = np.random.RandomState(0)
X_train = rng.rand(n_train, n_features).astype(np.float32)
cProfile.run("LocalOutlierFactor(n_neighbors=10).fit(X=X_train)") ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 _argkmin_lrd.pyx:134(_finalize_results)
1 0.031 0.031 0.074 0.074 _argkmin_lrd.pyx:34(compute)
1 0.000 0.000 0.015 0.015 _argkmin_lrd.pyx:64(__init__) It seems like the compute method if taking the actual time. |
Python and native code can be profiled with
I would recommend setting thread affinity to only use one. On GNU/Linux, this can be done using
Profiles' files can be uploaded here so that they be inspected by readers as well. |
I think I realized what the issue is with the significant drop. The metric was being set as euclidean which has a specialized implementation when we use kneighbors. |
Cool! |
These are the latest benchmarks with float32 Benchmark Codeimport numpy as np
from .common import Benchmark
from sklearn.neighbors import LocalOutlierFactor
class ArgKminLRDBenchmark(Benchmark):
param_names = ["n_train", "n_features"]
params = [
[1000, 10000, 100000],
[100],
]
def setup(self, n_train, n_features):
rng = np.random.RandomState(0)
self.X_train = rng.rand(n_train, n_features).astype(np.float32)
self.y_train = rng.randint(low=-1, high=1, size=(n_train,))
def time_fit(self, n_train, n_features):
est = LocalOutlierFactor(n_neighbors=10, metric="manhattan").fit(X=self.X_train)
self.estimator_ = est PR - time_fit
========= ============
-- n_features
--------- ------------
n_train 100
========= ============
1000 16.9±0.9ms
10000 926±20ms
100000 1.54±0.2m
========= ============
MAIN - time_fit
========= ============
-- n_features
--------- ------------
n_train 100
========= ============
1000 16.5±0.7ms
10000 926±9ms
100000 1.49±0.01m
========= ============
|
Benchmarks with float64 Benchmark Codeimport numpy as np
from .common import Benchmark
from sklearn.neighbors import LocalOutlierFactor
class ArgKminLRDBenchmark(Benchmark):
param_names = ["n_train", "n_features"]
params = [
[1000, 10000, 100000],
[100],
]
def setup(self, n_train, n_features):
rng = np.random.RandomState(0)
self.X_train = rng.rand(n_train, n_features).astype(np.float64)
self.y_train = rng.randint(low=-1, high=1, size=(n_train,))
def time_fit(self, n_train, n_features):
LocalOutlierFactor(n_neighbors=10, metric="manhattan").fit(X=self.X_train) PR - time_fit
========= ============
-- n_features
--------- ------------
n_train 100
========= ============
1000 18.6±0.1ms
10000 1.11±0.02s
100000 1.77±0m
========= ============
MAIN - time_fit
========= ============
-- n_features
--------- ------------
n_train 100
========= ============
1000 18.7±0.4ms
10000 1.11±0.02s
100000 1.80±0.01m
========= ============
|
It looks like no specific improvements are noted. |
Even by setting the number of neighbors to be 1000 no improvement was noted. Benchmark Codeimport numpy as np
from .common import Benchmark
from sklearn.neighbors import LocalOutlierFactor
class ArgKminLRDBenchmark(Benchmark):
param_names = ["n_train", "n_features"]
params = [
[100000],
[100],
]
def setup(self, n_train, n_features):
rng = np.random.RandomState(0)
self.X_train = rng.rand(n_train, n_features).astype(np.float32)
def time_fit(self, n_train, n_features):
LocalOutlierFactor(n_neighbors=1000, metric="manhattan").fit(X=self.X_train) [ 0.00%] · For scikit-learn commit d15e8c09 <pairwise_distance_backend_for_lof> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· argkmin_lrd.ArgKminLRDBenchmark.time_fit ok
[50.00%] ··· ========= ============
-- n_features
--------- ------------
n_train 100
========= ============
100000 1.68±0.01m
========= ============
[50.00%] · For scikit-learn commit 55af30d9 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-pandas-scipy-threadpoolctl..
[50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· argkmin_lrd.ArgKminLRDBenchmark.time_fit ok
[100.00%] ··· ========= =========
-- n_features
--------- ---------
n_train 100
========= =========
100000 1.68±0m
========= =========
BENCHMARKS NOT SIGNIFICANTLY CHANGED. |
After inspecting w/ @jjerphan @OmarManzoor please do sanity check me on this, but afaik I think this is just ill-suited for further optimization via a dedicated backend. This is good information though, since it allows us to focus efforts in other areas -- and also probably should set the precedent of confirming bottlenecks w/ an initial profiling rather than after work has been done :) Once you two are satisfied with this explanation, we can close this and move on. |
Thank you for inspecting with py-spy. I think what you are specifying makes sense and reinforces the observed benchmarks. |
@jjerphan I think we can close this PR then? |
Closing since no performance improvements can be reached. Sorry, @OmarManzoor. 🤦♂️ I wished I provided more evidenced certainty about room for performance improvements beforehand, with a profile of the current execution of algorithms for instance.
I think I will reword the issue description in this regard. |
No no there is no need to apologize. It was a good learning experience. We can try out |
❤️
I have updated #25888 to reflect the other ongoing/open work for finalizing the optimization of |
Reference Issues/PRs
Towards #25888
What does this implement/fix? Explain your changes.
Any other comments?
CC: @jjerphan Could you kindly have a look and see how this looks?