Skip to content

[MRG] Make OPTICS support sparse input #14736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

huntzhan
Copy link
Contributor

Reference Issues/PRs

See #12028 (comment), #11982

What does this implement/fix? Explain your changes.

Make OPTICS.fit(self, X, y=None) support the sparse matrix X.

Any other comments?

@jnothman
Copy link
Member

jnothman commented Aug 23, 2019 via email

@huntzhan
Copy link
Contributor Author

Sure. BTW, simply adding accept_sparse won't work. I'm looking into it.

sklearn/tests/test_common.py Estimator OPTICS doesn't seem to fail gracefully on sparse data: it should raise a TypeError if sparse input is explicitly not supported.
F
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

estimator = OPTICS(algorithm='auto', cluster_method='xi', eps=None, leaf_size=30,
       max_eps=inf, metric='minkowski', metric_params=None,
       min_cluster_size=None, min_samples=5, n_jobs=None, p=2,
       predecessor_correction=True, xi=0.05)
check = <function check_estimator_sparse_data at 0x12d122158>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

sklearn/tests/test_common.py:100:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/estimator_checks.py:539: in check_estimator_sparse_data
    estimator.fit(X, y)
sklearn/cluster/optics_.py:248: in fit
    max_eps=self.max_eps)
sklearn/cluster/optics_.py:487: in compute_optics_graph
    p=p, max_eps=max_eps)
sklearn/cluster/optics_.py:520: in _set_reach_dist
    dists = pairwise_distances(P, np.take(X, unproc, axis=0),
<__array_function__ internals>:6: in take
    ???
../../../../.pyenv/versions/3.6.9/envs/scikit-learn/lib/python3.6/site-packages/numpy/core/fromnumeric.py:194: in take
    return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
../../../../.pyenv/versions/3.6.9/envs/scikit-learn/lib/python3.6/site-packages/numpy/core/fromnumeric.py:58: in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <40x10 sparse matrix of type '<class 'numpy.float64'>'
	with 74 stored elements in Compressed Sparse Row format>, method = 'take'
args = (array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39]),)
kwds = {'axis': 0, 'mode': 'raise', 'out': None}, wrap = None

    def _wrapit(obj, method, *args, **kwds):
        try:
            wrap = obj.__array_wrap__
        except AttributeError:
            wrap = None
>       result = getattr(asarray(obj), method)(*args, **kwds)
E       IndexError: index 1 is out of bounds for axis 0 with size 1

@@ -517,7 +517,7 @@ def _set_reach_dist(core_distances_, reachability_, predecessor_,
# the same logic as neighbors, p is ignored if explicitly set
# in the dict params
_params['p'] = p
dists = pairwise_distances(P, np.take(X, unproc, axis=0),
dists = pairwise_distances(P, X[unproc],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.take doesn't support the sparse matrix

@@ -194,7 +194,7 @@ class OPTICS(BaseEstimator, ClusterMixin):
the Conference "Lernen, Wissen, Daten, Analysen" (LWDA) (2018): 318-329.
"""

def __init__(self, min_samples=5, max_eps=np.inf, metric='minkowski', p=2,
def __init__(self, min_samples=5, max_eps=np.inf, metric='euclidean', p=2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this won't work if the user changes p. So I'd rather reinterpret the metric as euclidean below when using the metric, rather than changing the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to add commits to resolve it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will resolve this issue and add test cases this week.

@@ -19,6 +19,7 @@
from ..neighbors import NearestNeighbors
from ..base import BaseEstimator, ClusterMixin
from ..metrics import pairwise_distances
from ..metrics.pairwise import PAIRWISE_DISTANCE_FUNCTIONS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import directly without exposing as an interface.

@huntzhan huntzhan changed the title [WIP] Make OPTICS support sparse input [MRG] Make OPTICS support sparse input Aug 26, 2019
@@ -233,7 +236,12 @@ def fit(self, X, y=None):
self : instance of OPTICS
The instance.
"""
X = check_array(X, dtype=np.float)
# TODO: Support the sparse input for metric = 'precopmuted'.
if self.metric != 'precomputed' \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a separated PR for this.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review yet, but looks good.

@@ -222,7 +223,9 @@ def fit(self, X, y=None):
Parameters
----------
X : array, shape (n_samples, n_features), or (n_samples, n_samples) \
if metric=’precomputed’.
if metric=’precomputed’, or sparse matrix (n_samples, n_features) if metric
in ['cityblock', 'cosine', 'euclidean', 'haversine', 'l2', 'l1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the first line of the X's description short and as usual syntax, and put the information in the comments of the parameter.

# TODO: Support the sparse input for metric = 'precopmuted'.
if self.metric != 'precomputed' \
and self.metric in PAIRWISE_DISTANCE_FUNCTIONS:
X = check_array(X, accept_sparse='csr')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and dtype=np.float

clust = OPTICS(min_samples=3, min_cluster_size=2,
max_eps=20, cluster_method='xi',
xi=0.4, metric='euclidean').fit(sparse.lil_matrix(X))
assert_array_equal(clust.labels_, expected_labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can parameterize the test for these ones and refactor them a bit

@bittremieux
Copy link

It would be nice to get OPTICS to work with sparse precomputed distance matrices. The OPTICS documentation says:

Better suited for usage on large datasets than the current sklearn implementation of DBSCAN.

But if your distance matrix doesn't fit in memory as a dense matrix you can't even use OPTICS, whereas that nicely works for DBSCAN.

@yskybloue
Copy link
Contributor

I see this is stalled for quite a while; I'm continuing work on this on #20802

@cmarmo cmarmo added Superseded PR has been replace by a newer PR and removed Stalled help wanted labels Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants