Skip to content

[WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. #11857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Aug 19, 2018

Fixes #11677, i.e.

  • Count split points as a member of the cluster if they're not far from the cluster
  • Detect the single outlier if it's the last point in ordering_

Adds an example to the docstring (See #3846)

Fixes the numerical warnings when there's more than one np.inf present in reachability_, which happens if max_bound is small (default is np.inf).

@adrinjalali adrinjalali changed the title fixes OPTICS split_points detected as NOISE and last point not detected as outlier. [WIP] fixes OPTICS split_points detected as NOISE and last point not detected as outlier. Aug 19, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tacking this. It is very much appreciated. What kinds of test would you add?


last_point = ordering[-1]
if (core_distances_plot[last_point] * 1.5
>= reachability_plot[last_point]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, you'll need to indent this further for pep8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the extra indentation is not a PEP8 requirement. It says both are okay, but I'm agnostic on that, will add the indentation. Please note that it won't be consistent with the other parts of the code in the same file. There are other multiline if conditions with a single indentation.

@@ -562,10 +573,17 @@ def _extract_optics(ordering, reachability, maxima_ratio=.75,
labels[index] = clustid
is_core[index] = 1
clustid += 1

last_point = ordering[-1]
if (core_distances_plot[last_point] * 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a magic number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a magic number which I've used in two places, and I'm not comfortable with it, i.e. I have no basis for it. Will need to do more tests to find a better empirically sane value. It can also be an input parameter (which most probably will always be left as default).

local_max_1 = []
local_max_2 = []

# wouldn't list expansion be faster? -> realloc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but not by a lot relative to the overall algorithm!

@jnothman jnothman added this to the 0.20 milestone Aug 19, 2018
@adrinjalali
Copy link
Member Author

I've added some tests, but way many more tests can be written for this class than what we have now.

Also, are you comfortable with the init parameters and their descriptions? I feel like for a good portion of them, the user need to read the code to understand what they do. Would that be a good idea to add an experimental tag to the class to be able to easily change the API in a non-backward-compatible way?

@jnothman
Copy link
Member

I would like to hope that most or all subsequent changes here are small bug fixes rather than substantial changes to the algorithm. We have tended in the past to not release things as experimental. I know we have done so with a couple of things in pretty new territory in this release, but I'm not enthusiastic about doing it here.

I'll give this another review soon.

@adrinjalali
Copy link
Member Author

adrinjalali commented Aug 21, 2018

There are some edge case issues left. This is the latest I encountered:

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> 
>>> np.random.seed(0)
>>> n_points_per_cluster = 3
>>> 
>>> C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
>>> C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
>>> X = np.vstack((C1, np.array([[100, 200]]), C2))
>>> #X = np.vstack((C1, ))
>>> 
>>> clust = OPTICS(min_samples=2, max_bound=2).fit(X)
/home/adrin/Projects/github.com/sklearn/.venv/lib/python3.7/site-packages/sklearn/cluster/optics_.py:748: RuntimeWarning: invalid value encountered in double_scalars
  (avg_reach2 / reachability_plot[s]) > maxima_ratio):
>>> clust.labels_
array([0, 0, 0, 1, 1, 1, 1])
>>> 
>>> c = clust
>>> 
>>> print(np.hstack((X[c.ordering_, :], clust.labels_[c.ordering_, np.newaxis])))
[[ -3.58875812  -1.67987423   0.        ]
 [ -3.50595361  -2.7818223    0.        ]
 [ -4.21700961  -0.20728544   0.        ]
 [100.         200.           1.        ]
 [  4.09500884  -1.01513572   1.        ]
 [  3.98967811  -0.95894015   1.        ]
 [  4.01440436  -0.85457265   1.        ]]
>>> 
>>> print(np.hstack((c.reachability_[c.ordering_].reshape(-1, 1),
...                  c.core_distances_[c.ordering_].reshape(-1, 1),
...                  c.labels_[c.ordering_].reshape(-1, 1))))
[[           inf 1.10505481e+00 0.00000000e+00]
 [1.10505481e+00 1.10505481e+00 0.00000000e+00]
 [1.60100521e+00 1.60100521e+00 0.00000000e+00]
 [           inf 2.22611307e+02 1.00000000e+00]
 [           inf 1.19383853e-01 1.00000000e+00]
 [1.19383853e-01 1.07256525e-01 1.00000000e+00]
 [1.07256525e-01 1.07256525e-01 1.00000000e+00]]

I've already fixed the fist line of _extract_optics as:

normalization_factor = np.max(reachability[reachability < np.inf])

But the rest of the code doesn't handle situations where there are more than one inf in the reachability distances.

EDIT: fixed.

@amueller
Copy link
Member

I haven't looked into OPTICS at all, but right now there's a test failure on 32 bit on master: https://travis-ci.org/MacPython/scikit-learn-wheels/jobs/404543529#L1208

Any input appreciated.

@adrinjalali
Copy link
Member Author

If I'm not mistaken, @espg has worked on that part.

That test checks the reachability_ values calculated by OPTICS. It's basically a data -> get neighbors -> get core dists -> calculate reachability -> extract clusters pipeline, and this array is the reachability one.

Values are set on this line: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/optics_.py#L405

 self.reachability_[unproc] = new_reach

I could spend some time and look into it further if I had access to a 32 bit system, but I don't. Sorry, but that's as much as I know @amueller !

@espg
Copy link
Contributor

espg commented Aug 22, 2018

@amueller I don't have access to a 32-bit system either, but I'm curious if the issue is in _optics_inner, which is called by quick_sort:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/optics_.py#L411

That code expects 64-bit floats, mainly because I wasn't sure how to write cython code that handled both data types and didn't want to clutter the main class with conditional checks on the distances. That's why OPTICS requires a float array (if you're using manhattan distances or some other integer based distance, they're converted to floats when the X array is passed in). Numpy defaults to using 64-bit floats as well...but I don't know how numpy behaves on 32-bit machines.

I'd have to look at the output of the array and look at the mismatch to be sure, but my guess is that the numbers in the array maybe aren't different, but are perhaps out of order in one or two places where the reachability distances are "small" -- i.e., you may have numbers that are different in 64-bit and sorted from "b a" to "a b" due to 'a' being marginally smaller then 'b'... but in 32-bit those numbers could be truncated so that 'a' == 'b', and the default behavior in quick_sort is that tied numbers are left in the original order, so "b a" would stay "b a".

If this is in fact what is happening, the data in the code for test could get changed to increase the magnitude of the points reference array. Multiplying by a 100 might be enough to ensure that small distances still maintain magnitude ordering; provided of course that this is the actual cause of the test failure.

@adrinjalali
Copy link
Member Author

The error we see on #11878 isn't from the cython code @espg. I moved that code into python, and it's the same issue. To be clear, almost all values are almost identical to the sixth digit precision, there's just one, which is like this:

0.4293346721668264, 0.427348, diff: 0.0019866721668264087

The difference stays the same when I take the quick_scan out of cython.

P.S. All the tests are on a 32 bit debian 9 VM @amueller :P

P.P.S. still trying to figure out what the issue is.

@espg
Copy link
Contributor

espg commented Aug 22, 2018

So there's one point out of 1500 that is off by 0.002, but only in 32 bit mode... that's really odd. Do you know if numpy still represents the numbers as 64-bit when running on a 32-bit machine? I'm wondering if maybe this is an issue in pairwise_distances, although I don't know why there would only be one data point impacted...

@adrinjalali
Copy link
Member Author

adrinjalali commented Aug 22, 2018 via email

@adrinjalali
Copy link
Member Author

@jnothman could you please give me a review on this one?

I'll be working on the 32bit issue, but that can kinda be treated independently, since it's coming from a different part of the code anyway.

After you review, I'll remove/change my notes accordingly. You'll see some questions/comments in this PR in the code, on which I'll appreciate your feed back.

@adrinjalali
Copy link
Member Author

@espg, the issue persists if we multiply all inputs by 100. Then we'll have this as a large change:

73.32993374225046 74.76657291338213 -1.4366391711316737

And the issue seems to be that on a 32 bit system, the points are processed in a different order. And if the data is multiplied by 100, they're still processed in a different order, but differently (compared to when they're not scaled up).

So on this line:

point = self._set_reach_dist(point, X, nbrs)

at some point the algorithm on 32 bit diverges from the path taken on a 64 bit system. My debugging are happening here: https://github.com/adrinjalali/scikit-learn/tree/bug/optics32

@adrinjalali
Copy link
Member Author

So I finally managed to figure out why the OPTICS diverges its path on 32 bit from 64 bit.

The issue is that at some point the reachability distances (new_reach to be specific) start being different (and the difference being of the order of 1e-13). Then, in quick_scan, those values are compared to each other (with == and <) and that 1e-13 makes a difference there.

To test, I added a tolerance to all comparisons in quick_scan as done here, and the final reachability results in 32 bit and 64 bit are now almost identical.

@espg
Copy link
Contributor

espg commented Aug 24, 2018

great work @adrinjalali -- thank you for tracking that down!

@jnothman
Copy link
Member

jnothman commented Aug 25, 2018 via email

@adrinjalali
Copy link
Member Author

I'm away for the weekend. Most probably I won't be able to open that PR before Monday :(

@jnothman
Copy link
Member

jnothman commented Aug 26, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some wonderful investigative work, @adrinjalali. I am a little concerned about the arbitrary heuristics that are already in the code, as well as those being added. We generally try to rely on the publication and maturity of an algorithm to rely on existing heuristics. What can we do to limit the effect of arbitrary heuristic choices?

@@ -634,7 +692,8 @@ def _find_local_maxima(reachability_plot, neighborhood_size):
# if the point is a local maxima on the reachability plot with
# regard to neighborhood_size, insert it into priority queue and
# maxima list
if (reachability_plot[i] > reachability_plot[i - 1] and
# all inf values are local maxima.
if (reachability_plot[i] >= reachability_plot[i - 1] and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems fine to me

return

# only check a certain ratio of points in the child
# nodes formed to the left and right of the maxima
# ...should check_ratio be a user settable parameter?
# NOTE: based on the description of maxima_ratio I'd guess this value
# should be 1, why is it less than 1?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@espg, could you comment here, please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@espg, we're still not sure why the code doesn't include all the points here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I think it was to allow for outliers (i.e. if you have an inf or split point). The original code was hierarchical, so you could have nested clusters belonging to the same parent cluster. My guess (and it is a guess) is that the check ratio addresses that case for nested clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this code, I recall it was because I was dealing with many closely nested candidate clusters, where the higher-level candidates in the cluster tree were being rejected for the lower-level candidates that were more tightly formed. I added this bit so I could tune the algorithm to be more inclusive of slightly further away points in the leaf clusters.

For the general purpose of having this in the library, this portion could probably be removed, as I don't think it's part of the original algorithm (@adrinjalali can confirm). This was the only place where I made any additional checks as compared to the paper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I read the paper yesterday (here for reference), and this is not a part of the main algorithm. However, I wouldn't be opposed to including this as a parameter, with 1 as the default value as is in the paper. We may need to alter the algorithm a bit for the split points issue anyway, and I know the other implementations (at least in R, although not of this algorithm, but the original OPTICS), don't strictly follow the publication and they may include some additions. But that's a different issue and I'm still working on how to articulate my thoughts on it the best I can.


clust = OPTICS(min_samples=3).fit(X)

assert_array_equal(clust.labels_, np.r_[[0]*4, [1]*4, -1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we prefer spaces around * unless to show arithmetic precedence.

NOTE: should we call this parameter min_neighbors, or n_neighbors,
to be more consistent with the NearestNeighbors API, which this parameter
is used for?
min_samples : int (default=5)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman sorry, I was talking about this parameter, which is different than min_cluster_size.

My comment here was:

should we call this parameter min_neighbors, or n_neighbors,
to be more consistent with the NearestNeighbors API, which this parameter
is used for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, call it min_samples in consistency with dbscan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Hadn't considered DBSCAN's terminology!

pair of instances (rows) and the resulting value recorded. The callable
should take two arrays from X as input and return a value indicating
the distance between them. The default is "euclidean" which is
interpreted as squared euclidean distance.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding metric, we can change the metric as suggested in my comment (which I took from t-sne.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. I didn't notice that we used squared euclidean here... DBSCAN uses plain euclidean for eps, doesn't it??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, it's a mistake in the docstring, the default is actually euclidean. I'll open another issue/PR regarding metrics issues.

if (core_distances_plot[-1] * 5 >= reachability_plot[-1]
and reachability_plot[-1] > reachability_plot[-2] * 5):
labels[last_point] = -1
is_core[last_point] = 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is regarding having an outlier in the last position of ordering_. This checks for that the same way it checks if a split point is an outlier, since it would be a split point if it wasn't the last point in the ordering_.

@@ -645,7 +704,7 @@ def _find_local_maxima(reachability_plot, neighborhood_size):


def _cluster_tree(node, parent_node, local_maxima_points,
reachability_plot, reachability_ordering,
reachability_plot, core_distances_plot, ordering,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed reachability_ordering to ordering since the same ordering applies (and is calculated) for both arrays (reachability and core_distances).

@adrinjalali
Copy link
Member Author

@jnothman I kinda need help how to proceed here.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I'm finding it hard to find time to thoroughly digest the nuances of the algorithm.

It seems to me that handling the multiple inf case is obviously what we want. I think you also have some good improvements to the code and docs here in general (and maybe these would be easier submitted as another PR). I like your tests too. I think the only thing I remain uncertain about is the use of those magic *5 in determining if the last point is noise.

It looks like this may be failing on one Travis setup.

Could you please remind me what other questions are outstanding? Thanks, and sorry for my scattered attention to it.

max_bound : float, optional
NOTE: max_radius (similar to NearestNeighbors API)
or max_eps (to follow paper's jargon)
max_bound : float, optional (default=np.inf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with max_eps. I thought I responded to this elsewhere.

@jnothman
Copy link
Member

jnothman commented Sep 3, 2018

Maybe put separate changes like max_bounds -> max_eps in a separate PR so this remains tractable to review.

@qinhanmin2014
Copy link
Member

+1 for max_bounds -> max_eps. It seems more reasonable. (After the change we use max_eps and eps, R dbscan uses eps and eps_cl)
@adrinjalali Could you please put the rename in a seperate PR? Thanks.

@espg
Copy link
Contributor

espg commented Sep 4, 2018

@adrinjalali I've been trying to catch up this thread... did you still have a pending question for me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants