Skip to content

Commit 43a8770

Browse files
committed
ENH Fix processing order in OPTICS. See #12090
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
1 parent dd95eba commit 43a8770

File tree

5 files changed

+184
-282
lines changed

5 files changed

+184
-282
lines changed

doc/whats_new/v0.21.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,8 @@ Support for Python 3.4 and below has been officially dropped.
4545

4646
- |MajorFeature| A new clustering algorithm: :class:`cluster.OPTICS`: an
4747
algoritm related to :class:`cluster.DBSCAN`, that has hyperparameters easier
48-
to set and that scales better, by :user:`Shane <espg>` and
49-
:user:`Adrin Jalali <adrinjalali>`.
48+
to set and that scales better, by :user:`Shane <espg>`,
49+
:user:`Adrin Jalali <adrinjalali>`, and :user:`Erich Schubert <kno10>`.
5050

5151
:mod:`sklearn.preprocessing`
5252
............................

sklearn/cluster/_optics_inner.pyx

-38
This file was deleted.

sklearn/cluster/optics_.py

+19-6
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@
2020
from ..neighbors import NearestNeighbors
2121
from ..base import BaseEstimator, ClusterMixin
2222
from ..metrics import pairwise_distances
23-
from ._optics_inner import quick_scan
2423

2524

2625
def optics(X, min_samples=5, max_eps=np.inf, metric='euclidean',
@@ -37,6 +36,13 @@ def optics(X, min_samples=5, max_eps=np.inf, metric='euclidean',
3736
neighborhood radius. Better suited for usage on large point datasets than
3837
the current sklearn implementation of DBSCAN.
3938
39+
This implementation deviates from the original OPTICS by first performing
40+
k-nearest-neighborhood searches on all points to identify core sizes, then
41+
computing only the distances to unprocessed points when constructing the
42+
cluster order. It also does not employ a heap to manage the expansion
43+
candiates, but rather uses numpy masked arrays. This can be potentially
44+
slower with some parameters (at the benefit from using fast numpy code).
45+
4046
Read more in the :ref:`User Guide <optics>`.
4147
4248
Parameters
@@ -190,6 +196,11 @@ class OPTICS(BaseEstimator, ClusterMixin):
190196
neighborhood radius. Better suited for usage on large point datasets than
191197
the current sklearn implementation of DBSCAN.
192198
199+
This implementation deviates from the original OPTICS by first performing
200+
k-nearest-neighborhood searches on all points to identify core sizes, then
201+
computing only the distances to unprocessed points when constructing the
202+
cluster order.
203+
193204
Read more in the :ref:`User Guide <optics>`.
194205
195206
Parameters
@@ -313,15 +324,15 @@ class OPTICS(BaseEstimator, ClusterMixin):
313324
``clust.reachability_[clust.ordering_]`` to access in cluster order.
314325
315326
ordering_ : array, shape (n_samples,)
316-
The cluster ordered list of sample indices
327+
The cluster ordered list of sample indices.
317328
318329
core_distances_ : array, shape (n_samples,)
319330
Distance at which each sample becomes a core point, indexed by object
320331
order. Points which will never be core have a distance of inf. Use
321332
``clust.core_distances_[clust.ordering_]`` to access in cluster order.
322333
323334
predecessor_ : array, shape (n_samples,)
324-
Point that a sample was reached from.
335+
Point that a sample was reached from, indexed by object order.
325336
Seed points have a predecessor of -1.
326337
327338
See also
@@ -516,9 +527,11 @@ def _set_reach_dist(self, point_index, processed, X, nbrs):
516527
self.reachability_[unproc[improved]] = rdists[improved]
517528
self.predecessor_[unproc[improved]] = point_index
518529

519-
# Define return order based on reachability distance
520-
return (unproc[quick_scan(np.take(self.reachability_, unproc),
521-
dists)])
530+
# Choose next based on smallest reachability distance
531+
# (And prefer smaller ids on ties).
532+
# All unprocessed points qualify, not just new neighbors ("unproc")
533+
return (np.ma.array(self.reachability_, mask=processed)
534+
.argmin(fill_value=np.inf))
522535

523536
def extract_dbscan(self, eps):
524537
"""Performs DBSCAN extraction for an arbitrary epsilon.

sklearn/cluster/setup.py

-4
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,6 @@ def configuration(parent_package='', top_path=None):
2323
sources=['_dbscan_inner.pyx'],
2424
include_dirs=[numpy.get_include()],
2525
language="c++")
26-
config.add_extension('_optics_inner',
27-
sources=['_optics_inner.pyx'],
28-
include_dirs=[numpy.get_include()],
29-
libraries=libraries)
3026

3127
config.add_extension('_hierarchical',
3228
sources=['_hierarchical.pyx'],

0 commit comments

Comments
 (0)