Skip to content

ENH Add memory=joblib.Memory param to OPTICS #19024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 6, 2021

Conversation

frankier
Copy link
Contributor

Reference Issues/PRs

Addresses, but doesn't not entirely fix #12044

What does this implement/fix? Explain your changes.

This adds a memory parameter to OPTICS. The justification for this is same as for hierarchical clustering. If we want to cut out dendrogram at different points, then we can reuse an intermediate result. The same is true for DBSCAN.

Practically speaking this avoids recomputation of the most expensive step of OPTICS during grid search as long as it is performed in serial.

As discussed in #12044 it would be good to eventually implement warm_start(...) and then users could run that in parallel as an initial step, before running grid search. A further embellishment would be grid searchers that understood the expensive and non expensive step computed things in the right order.

However, that issue has been open for a while now, and this at least solves the serial grid search use-case.

Any other comments?

Let me know if anything is missing.

Base automatically changed from master to main January 22, 2021 10:53
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Please report benchmarks just to show it has the desired effect.

@frankier
Copy link
Contributor Author

I've added a benchmark now but I'm having trouble running it.

$ asv run -E virtualenv -b OPTICSGridSearch HEAD

· Creating environments
· Discovering benchmarks
· No benchmarks selected

But this works:

$ asv run -E virtualenv -b LogisticRegression HEAD

Any tips?

@jnothman
Copy link
Member

I wasn't necessarily expecting benchmarks to be committed here, but at least to be presented. Thanks. I'd need to look into the asv issues with time I don't have right now!

@rth
Copy link
Member

rth commented Aug 6, 2021

Quick benchmark adapted from the asv example,

from sklearn.datasets import make_blobs
from sklearn.cluster import OPTICS
from sklearn.model_selection import GridSearchCV

X, y = make_blobs(n_samples=500)
optics = OPTICS(
    metric="euclidean",
    cluster_method="dbscan",
    min_samples=10,
)

Then,

In [20]: %timeit GridSearchCV(optics, {"eps": [0.1, 0.2, 0.3, 0.4, 0.5]}, scoring=lambda x, y: 1.0).fit(X)
3.11 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [21]: %timeit GridSearchCV(optics, {"eps": [0.1, 0.2, 0.3, 0.4, 0.5], "memory": ['/tmp/cache2']}, scoring=lambda x, y: 1.0).fit(X)
22.1 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So this works as expected. I removed the asv benchmark as I don't think it would provide useful information for the future. All we want to know is that caching works as expected.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, otherwise. Thanks @frankier !

@rth rth changed the title Add memory=joblib.Memory param to OPTICS ENH Add memory=joblib.Memory param to OPTICS Aug 6, 2021
@rth rth merged commit ec941a8 into scikit-learn:main Aug 6, 2021
@adrinjalali
Copy link
Member

Nice!

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API for getting DBSCAN-like clusterings out of OPTICS with fit_predict
4 participants