Skip to content

MNT remove optics from clustering.rst #13381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 4, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 0 additions & 111 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,6 @@ Overview of clustering methods
- Non-flat geometry, uneven cluster sizes
- Distances between nearest points

* - :ref:`OPTICS <optics>`
- minimum cluster membership
- Very large ``n_samples``, large ``n_clusters``
- Non-flat geometry, uneven cluster sizes, variable cluster density
- Distances between points

* - :ref:`Gaussian mixtures <mixture>`
- many
- Not scalable
Expand Down Expand Up @@ -802,11 +796,6 @@ by black points below.
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
A couple of mechanisms for getting around this are:

- Use :ref:`OPTICS <optics>` clustering in conjunction with the
`extract_dbscan` method. OPTICS clustering also calculates the full
pairwise matrix, but only keeps one row in memory at a time (memory
complexity n).

- A sparse radius neighborhood graph (where missing entries are presumed to
be out of eps) can be precomputed in a memory-efficient way and dbscan
can be run over this with ``metric='precomputed'``. See
Expand All @@ -825,106 +814,6 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996

.. _optics:

OPTICS
======

The :class:`OPTICS` algorithm shares many similarities with the
:class:`DBSCAN` algorithm, and can be considered a generalization of
DBSCAN that relaxes the ``eps`` requirement from a single value to a value
range. The key difference between DBSCAN and OPTICS is that the OPTICS
algorithm builds a *reachability* graph, which assigns each sample both a
``reachability_`` distance, and a spot within the cluster ``ordering_``
attribute; these two attributes are assigned when the model is fitted, and are
used to determine cluster membership. If OPTICS is run with the default value
of *inf* set for ``max_eps``, then DBSCAN style cluster extraction can be
performed in linear time for any given ``eps`` value using the
``extract_dbscan`` method. Setting ``max_eps`` to a lower value will result
in shorter run times, and can be thought of as the maximum cluster object size
(in diameter) that OPTICS will be able to extract.

.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
:target: ../auto_examples/cluster/plot_optics.html
:scale: 50

.. centered:: |optics_results|

The *reachability* distances generated by OPTICS allow for variable density
extraction of clusters within a single data set. As shown in the above plot,
combining *reachability* distances and data set ``ordering_`` produces a
*reachability plot*, where point density is represented on the Y-axis, and
points are ordered such that nearby points are adjacent. 'Cutting' the
reachability plot at a single value produces DBSCAN like results; all points
above the 'cut' are classified as noise, and each time that there is a break
when reading from left to right signifies a new cluster. The default cluster
extraction with OPTICS looks at changes in slope within the graph to guess at
natural clusters. There are also other possibilities for analysis on the graph
itself, such as generating hierarchical representations of the data through
reachability-plot dendrograms. The plot above has been color-coded so that
cluster colors in planar space match the linear segment clusters of the
reachability plot-- note that the blue and red clusters are adjacent in the
reachability plot, and can be hierarchically represented as children of a
larger parent cluster.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`


.. topic:: Comparison with DBSCAN

The results from OPTICS ``extract_dbscan`` method and DBSCAN are not quite
identical. Specifically, while *core_samples* returned from both OPTICS
and DBSCAN are guaranteed to be identical, labeling of periphery and noise
points is not. This is in part because the first sample processed by
OPTICS will always have a reachability distance that is set to ``inf``,
and will thus generally be marked as noise rather than periphery. This
affects adjacent points when they are considered as candidates for being
marked as either periphery or noise. While this effect is quite local to
the starting point of the dataset and is unlikely to be noticed on even
moderately large datasets, it is worth also noting that non-core boundry
points may switch cluster labels on the rare occasion that they are
equidistant to a competeing cluster due to how the graph is read from left
to right when assigning labels.

Note that for any single value of ``eps``, DBSCAN will tend to have a
shorter run time than OPTICS; however, for repeated runs at varying ``eps``
values, a single run of OPTICS may require less cumulative runtime than
DBSCAN. It is also important to note that OPTICS output can be unstable at
``eps`` values very close to the initial ``max_eps`` value. OPTICS seems
to produce near identical results to DBSCAN provided that ``eps`` passed to
``extract_dbscan`` is a half order of magnitude less than the inital
``max_eps`` that was used to fit; using a value close to ``max_eps``
will throw a warning, and using a value larger will result in an exception.

.. topic:: Computational Complexity

Spatial indexing trees are used to avoid calculating the full distance
matrix, and allow for efficient memory usage on large sets of samples.
Different distance metrics can be supplied via the ``metric`` keyword.

For large datasets, similar (but not identical) results can be obtained via
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
multithreaded, and has better algorithmic runtime complexity than OPTICS--
at the cost of worse memory scaling. For extremely large datasets that
exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
to *n^2* memory scaling); however, tuning of the ``max_eps`` parameter
will likely need to be used to give a solution in a reasonable amount of
wall time.

.. topic:: References:

* "OPTICS: ordering points to identify the clustering structure."
Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.

* "Automatic extraction of clusters from hierarchical clustering
representations."
Sander, Jörg, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky.
In Advances in knowledge discovery and data mining,
pp. 75-87. Springer Berlin Heidelberg, 2003.

.. _birch:

Birch
Expand Down