-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Make OPTICS support sparse input #14736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please add a test
…On Fri., 23 Aug. 2019, 2:01 pm Hunt Zhan, ***@***.***> wrote:
Reference Issues/PRs
See #12028 (comment)
<#12028 (comment)>,
#11982 <#11982>
What does this implement/fix? Explain your changes.
Make OPTICS.fit(self, X, y=None) support the sparse matrix X.
Any other comments?
------------------------------
You can view, comment on, or merge this pull request online at:
#14736
Commit Summary
- Try csr support.
File Changes
- *M* sklearn/cluster/optics_.py
<https://github.com/scikit-learn/scikit-learn/pull/14736/files#diff-0>
(2)
Patch Links:
- https://github.com/scikit-learn/scikit-learn/pull/14736.patch
- https://github.com/scikit-learn/scikit-learn/pull/14736.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#14736?email_source=notifications&email_token=AAATH23GHNSYA2K22NK3TFLQF5OKVA5CNFSM4IO3NUUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HG6DQ4Q>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAATH2YURNDHD2PGE4J5UQ3QF5OKVANCNFSM4IO3NUUA>
.
|
Sure. BTW, simply adding accept_sparse won't work. I'm looking into it.
|
@@ -517,7 +517,7 @@ def _set_reach_dist(core_distances_, reachability_, predecessor_, | |||
# the same logic as neighbors, p is ignored if explicitly set | |||
# in the dict params | |||
_params['p'] = p | |||
dists = pairwise_distances(P, np.take(X, unproc, axis=0), | |||
dists = pairwise_distances(P, X[unproc], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.take
doesn't support the sparse matrix
sklearn/cluster/optics_.py
Outdated
@@ -194,7 +194,7 @@ class OPTICS(BaseEstimator, ClusterMixin): | |||
the Conference "Lernen, Wissen, Daten, Analysen" (LWDA) (2018): 318-329. | |||
""" | |||
|
|||
def __init__(self, min_samples=5, max_eps=np.inf, metric='minkowski', p=2, | |||
def __init__(self, min_samples=5, max_eps=np.inf, metric='euclidean', p=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default metric minkowski
doesn't accept the sparse matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this won't work if the user changes p
. So I'd rather reinterpret the metric as euclidean below when using the metric, rather than changing the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to add commits to resolve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will resolve this issue and add test cases this week.
@@ -19,6 +19,7 @@ | |||
from ..neighbors import NearestNeighbors | |||
from ..base import BaseEstimator, ClusterMixin | |||
from ..metrics import pairwise_distances | |||
from ..metrics.pairwise import PAIRWISE_DISTANCE_FUNCTIONS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import directly without exposing as an interface.
@@ -233,7 +236,12 @@ def fit(self, X, y=None): | |||
self : instance of OPTICS | |||
The instance. | |||
""" | |||
X = check_array(X, dtype=np.float) | |||
# TODO: Support the sparse input for metric = 'precopmuted'. | |||
if self.metric != 'precomputed' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have a separated PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a complete review yet, but looks good.
@@ -222,7 +223,9 @@ def fit(self, X, y=None): | |||
Parameters | |||
---------- | |||
X : array, shape (n_samples, n_features), or (n_samples, n_samples) \ | |||
if metric=’precomputed’. | |||
if metric=’precomputed’, or sparse matrix (n_samples, n_features) if metric | |||
in ['cityblock', 'cosine', 'euclidean', 'haversine', 'l2', 'l1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the first line of the X
's description short and as usual syntax, and put the information in the comments of the parameter.
# TODO: Support the sparse input for metric = 'precopmuted'. | ||
if self.metric != 'precomputed' \ | ||
and self.metric in PAIRWISE_DISTANCE_FUNCTIONS: | ||
X = check_array(X, accept_sparse='csr') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and dtype=np.float
clust = OPTICS(min_samples=3, min_cluster_size=2, | ||
max_eps=20, cluster_method='xi', | ||
xi=0.4, metric='euclidean').fit(sparse.lil_matrix(X)) | ||
assert_array_equal(clust.labels_, expected_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can parameterize the test for these ones and refactor them a bit
It would be nice to get OPTICS to work with sparse precomputed distance matrices. The OPTICS documentation says:
But if your distance matrix doesn't fit in memory as a dense matrix you can't even use OPTICS, whereas that nicely works for DBSCAN. |
I see this is stalled for quite a while; I'm continuing work on this on #20802 |
Reference Issues/PRs
See #12028 (comment), #11982
What does this implement/fix? Explain your changes.
Make
OPTICS.fit(self, X, y=None)
support the sparse matrixX
.Any other comments?