-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Add memory=joblib.Memory param to OPTICS #19024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Please report benchmarks just to show it has the desired effect.
I've added a benchmark now but I'm having trouble running it.
But this works:
Any tips? |
I wasn't necessarily expecting benchmarks to be committed here, but at least to be presented. Thanks. I'd need to look into the asv issues with time I don't have right now! |
Quick benchmark adapted from the asv example, from sklearn.datasets import make_blobs
from sklearn.cluster import OPTICS
from sklearn.model_selection import GridSearchCV
X, y = make_blobs(n_samples=500)
optics = OPTICS(
metric="euclidean",
cluster_method="dbscan",
min_samples=10,
) Then, In [20]: %timeit GridSearchCV(optics, {"eps": [0.1, 0.2, 0.3, 0.4, 0.5]}, scoring=lambda x, y: 1.0).fit(X)
3.11 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [21]: %timeit GridSearchCV(optics, {"eps": [0.1, 0.2, 0.3, 0.4, 0.5], "memory": ['/tmp/cache2']}, scoring=lambda x, y: 1.0).fit(X)
22.1 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) So this works as expected. I removed the asv benchmark as I don't think it would provide useful information for the future. All we want to know is that caching works as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, otherwise. Thanks @frankier !
Nice! |
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Reference Issues/PRs
Addresses, but doesn't not entirely fix #12044
What does this implement/fix? Explain your changes.
This adds a memory parameter to OPTICS. The justification for this is same as for hierarchical clustering. If we want to cut out dendrogram at different points, then we can reuse an intermediate result. The same is true for DBSCAN.
Practically speaking this avoids recomputation of the most expensive step of OPTICS during grid search as long as it is performed in serial.
As discussed in #12044 it would be good to eventually implement
warm_start(...)
and then users could run that in parallel as an initial step, before running grid search. A further embellishment would be grid searchers that understood the expensive and non expensive step computed things in the right order.However, that issue has been open for a while now, and this at least solves the serial grid search use-case.
Any other comments?
Let me know if anything is missing.