[WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857

adrinjalali · 2018-08-19T19:12:20Z

Fixes #11677, i.e.

Count split points as a member of the cluster if they're not far from the cluster
Detect the single outlier if it's the last point in ordering_

Adds an example to the docstring (See #3846)

Fixes the numerical warnings when there's more than one np.inf present in reachability_, which happens if max_bound is small (default is np.inf).

jnothman

Thanks for tacking this. It is very much appreciated. What kinds of test would you add?

jnothman · 2018-08-19T23:19:07Z

sklearn/cluster/optics_.py

+
+    last_point = ordering[-1]
+    if (core_distances_plot[last_point] * 1.5
+        >= reachability_plot[last_point]):


unfortunately, you'll need to indent this further for pep8.

I think the extra indentation is not a PEP8 requirement. It says both are okay, but I'm agnostic on that, will add the indentation. Please note that it won't be consistent with the other parts of the code in the same file. There are other multiline if conditions with a single indentation.

jnothman · 2018-08-19T23:20:03Z

sklearn/cluster/optics_.py

@@ -562,10 +573,17 @@ def _extract_optics(ordering, reachability, maxima_ratio=.75,
        labels[index] = clustid
        is_core[index] = 1
        clustid += 1
+
+    last_point = ordering[-1]
+    if (core_distances_plot[last_point] * 1.5


Is this a magic number?

Yes, it's a magic number which I've used in two places, and I'm not comfortable with it, i.e. I have no basis for it. Will need to do more tests to find a better empirically sane value. It can also be an input parameter (which most probably will always be left as default).

jnothman · 2018-08-19T23:21:59Z

sklearn/cluster/optics_.py

    local_max_1 = []
    local_max_2 = []

+    # wouldn't list expansion be faster? -> realloc


probably, but not by a lot relative to the overall algorithm!

adrinjalali · 2018-08-20T16:59:49Z

I've added some tests, but way many more tests can be written for this class than what we have now.

Also, are you comfortable with the init parameters and their descriptions? I feel like for a good portion of them, the user need to read the code to understand what they do. Would that be a good idea to add an experimental tag to the class to be able to easily change the API in a non-backward-compatible way?

jnothman · 2018-08-20T23:09:12Z

I would like to hope that most or all subsequent changes here are small bug fixes rather than substantial changes to the algorithm. We have tended in the past to not release things as experimental. I know we have done so with a couple of things in pretty new territory in this release, but I'm not enthusiastic about doing it here.

I'll give this another review soon.

adrinjalali · 2018-08-21T08:58:09Z

There are some edge case issues left. This is the latest I encountered:

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> 
>>> np.random.seed(0)
>>> n_points_per_cluster = 3
>>> 
>>> C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
>>> C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
>>> X = np.vstack((C1, np.array([[100, 200]]), C2))
>>> #X = np.vstack((C1, ))
>>> 
>>> clust = OPTICS(min_samples=2, max_bound=2).fit(X)
/home/adrin/Projects/github.com/sklearn/.venv/lib/python3.7/site-packages/sklearn/cluster/optics_.py:748: RuntimeWarning: invalid value encountered in double_scalars
  (avg_reach2 / reachability_plot[s]) > maxima_ratio):
>>> clust.labels_
array([0, 0, 0, 1, 1, 1, 1])
>>> 
>>> c = clust
>>> 
>>> print(np.hstack((X[c.ordering_, :], clust.labels_[c.ordering_, np.newaxis])))
[[ -3.58875812  -1.67987423   0.        ]
 [ -3.50595361  -2.7818223    0.        ]
 [ -4.21700961  -0.20728544   0.        ]
 [100.         200.           1.        ]
 [  4.09500884  -1.01513572   1.        ]
 [  3.98967811  -0.95894015   1.        ]
 [  4.01440436  -0.85457265   1.        ]]
>>> 
>>> print(np.hstack((c.reachability_[c.ordering_].reshape(-1, 1),
...                  c.core_distances_[c.ordering_].reshape(-1, 1),
...                  c.labels_[c.ordering_].reshape(-1, 1))))
[[           inf 1.10505481e+00 0.00000000e+00]
 [1.10505481e+00 1.10505481e+00 0.00000000e+00]
 [1.60100521e+00 1.60100521e+00 0.00000000e+00]
 [           inf 2.22611307e+02 1.00000000e+00]
 [           inf 1.19383853e-01 1.00000000e+00]
 [1.19383853e-01 1.07256525e-01 1.00000000e+00]
 [1.07256525e-01 1.07256525e-01 1.00000000e+00]]

I've already fixed the fist line of _extract_optics as:

normalization_factor = np.max(reachability[reachability < np.inf])

But the rest of the code doesn't handle situations where there are more than one inf in the reachability distances.

EDIT: fixed.

amueller · 2018-08-21T18:21:15Z

I haven't looked into OPTICS at all, but right now there's a test failure on 32 bit on master: https://travis-ci.org/MacPython/scikit-learn-wheels/jobs/404543529#L1208

Any input appreciated.

adrinjalali · 2018-08-21T19:43:14Z

If I'm not mistaken, @espg has worked on that part.

That test checks the reachability_ values calculated by OPTICS. It's basically a data -> get neighbors -> get core dists -> calculate reachability -> extract clusters pipeline, and this array is the reachability one.

Values are set on this line: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/optics_.py#L405

 self.reachability_[unproc] = new_reach

I could spend some time and look into it further if I had access to a 32 bit system, but I don't. Sorry, but that's as much as I know @amueller !

espg · 2018-08-22T02:51:23Z

@amueller I don't have access to a 32-bit system either, but I'm curious if the issue is in _optics_inner, which is called by quick_sort:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/optics_.py#L411

That code expects 64-bit floats, mainly because I wasn't sure how to write cython code that handled both data types and didn't want to clutter the main class with conditional checks on the distances. That's why OPTICS requires a float array (if you're using manhattan distances or some other integer based distance, they're converted to floats when the X array is passed in). Numpy defaults to using 64-bit floats as well...but I don't know how numpy behaves on 32-bit machines.

I'd have to look at the output of the array and look at the mismatch to be sure, but my guess is that the numbers in the array maybe aren't different, but are perhaps out of order in one or two places where the reachability distances are "small" -- i.e., you may have numbers that are different in 64-bit and sorted from "b a" to "a b" due to 'a' being marginally smaller then 'b'... but in 32-bit those numbers could be truncated so that 'a' == 'b', and the default behavior in quick_sort is that tied numbers are left in the original order, so "b a" would stay "b a".

If this is in fact what is happening, the data in the code for test could get changed to increase the magnitude of the points reference array. Multiplying by a 100 might be enough to ensure that small distances still maintain magnitude ordering; provided of course that this is the actual cause of the test failure.

adrinjalali · 2018-08-22T14:12:00Z

The error we see on #11878 isn't from the cython code @espg. I moved that code into python, and it's the same issue. To be clear, almost all values are almost identical to the sixth digit precision, there's just one, which is like this:

0.4293346721668264, 0.427348, diff: 0.0019866721668264087

The difference stays the same when I take the quick_scan out of cython.

P.S. All the tests are on a 32 bit debian 9 VM @amueller :P

P.P.S. still trying to figure out what the issue is.

espg · 2018-08-22T21:16:36Z

So there's one point out of 1500 that is off by 0.002, but only in 32 bit mode... that's really odd. Do you know if numpy still represents the numbers as 64-bit when running on a 32-bit machine? I'm wondering if maybe this is an issue in pairwise_distances, although I don't know why there would only be one data point impacted...

adrinjalali · 2018-08-22T21:44:44Z

I believe I checked and everything was float 64. I'll have more time to

spend on this tomorrow.

…

adrinjalali · 2018-08-23T11:58:12Z

@jnothman could you please give me a review on this one?

I'll be working on the 32bit issue, but that can kinda be treated independently, since it's coming from a different part of the code anyway.

After you review, I'll remove/change my notes accordingly. You'll see some questions/comments in this PR in the code, on which I'll appreciate your feed back.

adrinjalali · 2018-08-23T15:35:51Z

@espg, the issue persists if we multiply all inputs by 100. Then we'll have this as a large change:

73.32993374225046 74.76657291338213 -1.4366391711316737

And the issue seems to be that on a 32 bit system, the points are processed in a different order. And if the data is multiplied by 100, they're still processed in a different order, but differently (compared to when they're not scaled up).

So on this line:

point = self._set_reach_dist(point, X, nbrs)

at some point the algorithm on 32 bit diverges from the path taken on a 64 bit system. My debugging are happening here: https://github.com/adrinjalali/scikit-learn/tree/bug/optics32

adrinjalali · 2018-08-24T17:13:18Z

So I finally managed to figure out why the OPTICS diverges its path on 32 bit from 64 bit.

The issue is that at some point the reachability distances (new_reach to be specific) start being different (and the difference being of the order of 1e-13). Then, in quick_scan, those values are compared to each other (with == and <) and that 1e-13 makes a difference there.

To test, I added a tolerance to all comparisons in quick_scan as done here, and the final reachability results in 32 bit and 64 bit are now almost identical.

espg · 2018-08-24T18:15:08Z

great work @adrinjalali -- thank you for tracking that down!

jnothman · 2018-08-25T10:59:37Z

@adrinjalali, at wit's end with this release, I'm almost ready to put in a hot-fix for that numerical precision issue... but it would be wonderful if you could open a separate PR with that change and a test (even just testing quick_scan directly)

adrinjalali · 2018-08-25T11:44:41Z

I'm away for the weekend. Most probably I won't be able to open that PR before Monday :(

jnothman · 2018-08-26T01:53:58Z

@adrinjalali, I'm not entirely convinced that it's the right fix. It will make it consistent across platforms, but can easily produce a different order from expected. I'm not completely confident. I'd be interested in @agramfort's thoughts on getting a minimum within tolerance, rather than the actual minimum, for the sake of cross-platform consistency

jnothman

Some wonderful investigative work, @adrinjalali. I am a little concerned about the arbitrary heuristics that are already in the code, as well as those being added. We generally try to rely on the publication and maturity of an algorithm to rely on existing heuristics. What can we do to limit the effect of arbitrary heuristic choices?

sklearn/cluster/optics_.py

jnothman · 2018-08-26T23:41:37Z

sklearn/cluster/optics_.py

@@ -634,7 +692,8 @@ def _find_local_maxima(reachability_plot, neighborhood_size):
        # if the point is a local maxima on the reachability plot with
        # regard to neighborhood_size, insert it into priority queue and
        # maxima list
-        if (reachability_plot[i] > reachability_plot[i - 1] and
+        # all inf values are local maxima.
+        if (reachability_plot[i] >= reachability_plot[i - 1] and


this seems fine to me

jnothman · 2018-08-26T23:43:18Z

sklearn/cluster/optics_.py

        return

    # only check a certain ratio of points in the child
    # nodes formed to the left and right of the maxima
    # ...should check_ratio be a user settable parameter?
+    # NOTE: based on the description of maxima_ratio I'd guess this value
+    # should be 1, why is it less than 1?


@espg, could you comment here, please?

@espg, we're still not sure why the code doesn't include all the points here.

Not sure, but I think it was to allow for outliers (i.e. if you have an inf or split point). The original code was hierarchical, so you could have nested clusters belonging to the same parent cluster. My guess (and it is a guess) is that the check ratio addresses that case for nested clusters.

When I wrote this code, I recall it was because I was dealing with many closely nested candidate clusters, where the higher-level candidates in the cluster tree were being rejected for the lower-level candidates that were more tightly formed. I added this bit so I could tune the algorithm to be more inclusive of slightly further away points in the leaf clusters.

For the general purpose of having this in the library, this portion could probably be removed, as I don't think it's part of the original algorithm (@adrinjalali can confirm). This was the only place where I made any additional checks as compared to the paper.

Yes, I read the paper yesterday (here for reference), and this is not a part of the main algorithm. However, I wouldn't be opposed to including this as a parameter, with 1 as the default value as is in the paper. We may need to alter the algorithm a bit for the split points issue anyway, and I know the other implementations (at least in R, although not of this algorithm, but the original OPTICS), don't strictly follow the publication and they may include some additions. But that's a different issue and I'm still working on how to articulate my thoughts on it the best I can.

jnothman · 2018-08-26T23:46:17Z

sklearn/cluster/tests/test_optics.py

+
+    clust = OPTICS(min_samples=3).fit(X)
+
+    assert_array_equal(clust.labels_, np.r_[[0]*4, [1]*4, -1])


we prefer spaces around * unless to show arithmetic precedence.

adrinjalali · 2018-08-27T09:12:39Z

sklearn/cluster/optics_.py

+    NOTE: should we call this parameter min_neighbors, or n_neighbors,
+    to be more consistent with the NearestNeighbors API, which this parameter
+    is used for?
+    min_samples : int (default=5)


@jnothman sorry, I was talking about this parameter, which is different than min_cluster_size.

My comment here was:

should we call this parameter min_neighbors, or n_neighbors,
to be more consistent with the NearestNeighbors API, which this parameter
is used for?

No, call it min_samples in consistency with dbscan

Yep! Hadn't considered DBSCAN's terminology!

adrinjalali · 2018-08-27T09:14:00Z

sklearn/cluster/optics_.py

+            pair of instances (rows) and the resulting value recorded. The callable
+            should take two arrays from X as input and return a value indicating
+            the distance between them. The default is "euclidean" which is
+            interpreted as squared euclidean distance.


regarding metric, we can change the metric as suggested in my comment (which I took from t-sne.

Huh. I didn't notice that we used squared euclidean here... DBSCAN uses plain euclidean for eps, doesn't it??

We don't, it's a mistake in the docstring, the default is actually euclidean. I'll open another issue/PR regarding metrics issues.

sklearn/cluster/optics_.py

adrinjalali · 2018-08-27T09:17:48Z

sklearn/cluster/optics_.py

+    if (core_distances_plot[-1] * 5 >= reachability_plot[-1]
+            and reachability_plot[-1] > reachability_plot[-2] * 5):
+        labels[last_point] = -1
+        is_core[last_point] = 0


This is regarding having an outlier in the last position of ordering_. This checks for that the same way it checks if a split point is an outlier, since it would be a split point if it wasn't the last point in the ordering_.

adrinjalali · 2018-08-27T09:20:58Z

sklearn/cluster/optics_.py

@@ -645,7 +704,7 @@ def _find_local_maxima(reachability_plot, neighborhood_size):


 def _cluster_tree(node, parent_node, local_maxima_points,
-                  reachability_plot, reachability_ordering,
+                  reachability_plot, core_distances_plot, ordering,


Changed reachability_ordering to ordering since the same ordering applies (and is calculated) for both arrays (reachability and core_distances).

sklearn/cluster/optics_.py

adrinjalali · 2018-09-01T14:28:49Z

@jnothman I kinda need help how to proceed here.

jnothman

I'm sorry I'm finding it hard to find time to thoroughly digest the nuances of the algorithm.

It seems to me that handling the multiple inf case is obviously what we want. I think you also have some good improvements to the code and docs here in general (and maybe these would be easier submitted as another PR). I like your tests too. I think the only thing I remain uncertain about is the use of those magic *5 in determining if the last point is noise.

It looks like this may be failing on one Travis setup.

Could you please remind me what other questions are outstanding? Thanks, and sorry for my scattered attention to it.

jnothman · 2018-09-02T23:23:37Z

sklearn/cluster/optics_.py

-    max_bound : float, optional
+    NOTE: max_radius (similar to NearestNeighbors API)
+    or max_eps (to follow paper's jargon)
+    max_bound : float, optional (default=np.inf)


I'm happy with max_eps. I thought I responded to this elsewhere.

sklearn/cluster/optics_.py

jnothman · 2018-09-03T13:08:09Z

Maybe put separate changes like max_bounds -> max_eps in a separate PR so this remains tractable to review.

qinhanmin2014 · 2018-09-03T13:18:39Z

+1 for max_bounds -> max_eps. It seems more reasonable. (After the change we use max_eps and eps, R dbscan uses eps and eps_cl)
@adrinjalali Could you please put the rename in a seperate PR? Thanks.

espg · 2018-09-04T00:42:19Z

@adrinjalali I've been trying to catch up this thread... did you still have a pending question for me?

adrinjalali changed the title ~~fixes OPTICS split_points detected as NOISE and last point not detected as outlier.~~ [WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. Aug 19, 2018

adrinjalali mentioned this pull request Aug 19, 2018

OPTICS detecting the wrong outlier #11677

Closed

jnothman reviewed Aug 19, 2018

View reviewed changes

jnothman added this to the 0.20 milestone Aug 19, 2018

espg mentioned this pull request Aug 22, 2018

Build failures for master/0.20 #11878

Closed

jnothman reviewed Aug 26, 2018

View reviewed changes

jnothman mentioned this pull request Aug 27, 2018

[MRG] TST/FIX stop optics reachability failure on 32bit #11916

Merged

adrinjalali commented Aug 27, 2018

View reviewed changes

sklearn/cluster/optics_.py Show resolved Hide resolved

adrinjalali commented Aug 27, 2018

View reviewed changes

sklearn/cluster/optics_.py Outdated Show resolved Hide resolved

adrinjalali commented Aug 27, 2018

View reviewed changes

sklearn/cluster/optics_.py Show resolved Hide resolved

adrinjalali force-pushed the bug/optics branch from a19867a to 29a1f0f Compare August 31, 2018 15:02

spaces around *

1c683a0

cleanup

d1ce12d

jnothman reviewed Sep 2, 2018

View reviewed changes

adrinjalali added 3 commits September 3, 2018 14:39

rename max_bound to max_eps

8bb29cf

rename max_bound to max_eps in tests and docs

748f8c2

make class and function's docstring consistent

c40306f

adrinjalali mentioned this pull request Sep 3, 2018

[MRG+2] OPTICS: change max_bound -> max_eps #11984

Merged

fix PEP issue in comment

846979f

adrinjalali mentioned this pull request Sep 3, 2018

[MRG+1] OPTICS remove redundant recursion #11985

Merged

Merge branch 'master' into bug/optics

e654ea6

adrinjalali mentioned this pull request Sep 3, 2018

[MRG+1] OPTICS: improve docstring and add default values. #11987

Merged

adrinjalali added 2 commits September 4, 2018 09:23

Merge remote-tracking branch 'upstream/master' into bug/optics

7ab9dcd

remove changes addressed in other PRs

52edc47

This was referenced Sep 6, 2018

[MRG+1] OPTICS correctly handle multiple infs in reachability array. #12029

Merged

OPTICS extraction methods #12036

Closed

adrinjalali added 2 commits September 8, 2018 23:33

merge master

58d278c

Merge branch 'master' into bug/optics

9d9a4e4

adrinjalali mentioned this pull request Sep 10, 2018

[WIP] OPTICS include split points in clusters if not noise #12049

Closed

espg mentioned this pull request Sep 12, 2018

Remove OPTICS from 0.20 #12053

Merged

jnothman modified the milestones: 0.20, 0.21 Sep 17, 2018

kno10 mentioned this pull request Oct 11, 2018

[MRG+1] Fix OPTICS processing order #12357

Merged

adrinjalali mentioned this pull request Feb 26, 2019

[MRG+1] OPTICS fit uses the selected extract_method parameter #12087

Merged

GaelVaroquaux closed this in #12087 Mar 3, 2019

adrinjalali deleted the bug/optics branch May 12, 2022 12:45


		clust = OPTICS(min_samples=3).fit(X)

		assert_array_equal(clust.labels_, np.r_[[0]4, [1]4, -1])

[WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857

[WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857

Conversation

adrinjalali commented Aug 19, 2018 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Aug 20, 2018

jnothman commented Aug 20, 2018

adrinjalali commented Aug 21, 2018 • edited Loading

amueller commented Aug 21, 2018

adrinjalali commented Aug 21, 2018

espg commented Aug 22, 2018

adrinjalali commented Aug 22, 2018

espg commented Aug 22, 2018

adrinjalali commented Aug 22, 2018 via email

adrinjalali commented Aug 23, 2018

adrinjalali commented Aug 23, 2018

adrinjalali commented Aug 24, 2018

espg commented Aug 24, 2018

jnothman commented Aug 25, 2018 via email

adrinjalali commented Aug 25, 2018

jnothman commented Aug 26, 2018 via email

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Sep 1, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Sep 3, 2018

qinhanmin2014 commented Sep 3, 2018

espg commented Sep 4, 2018

adrinjalali commented Aug 19, 2018 •

edited

Loading

adrinjalali commented Aug 21, 2018 •

edited

Loading