[WIP] OPTICS include split points in clusters if not noise #12049

adrinjalali · 2018-09-10T08:54:07Z

Fixes #11677 (OPTICS detecting the wrong outlier)
See PR #11857 (related discussions)

Borrowing the concept of Xi-steep points and the parameter, this PR suggests to include a split point in a cluster if it's a steep downward point. It also checks if the last point of a cluster is a noise.

adrinjalali · 2018-09-12T12:27:11Z

@kno10, for the current extraction method, we also have the issue of not including the split points in the clusters as it stands. Here I'm proposing something similar to the xi-steep, to detect the ones which should be a part of the current cluster, and use the same principle for the last point.

I'd really appreciate your feedback on this.

@espg, do you think you could add the predecessor while extracting reachability with reasonably little effort?

(#11677 (comment))

@adrinjalali The dbscan package states: "The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation". But the same technique can (and should) be used with any other plot-based extraction technique.
The article mentioned above http://ceur-ws.org/Vol-2191/paper37.pdf explains why the "last" point of a valley may or may not belong to the cluster, and how to use the predecessor information to distinguish these two cases. If you always include the last point, you'll get "spike" artifacts.

espg · 2018-09-12T16:51:15Z

@adrinjalali I think so. It's likely a pretty small change to store this, but I'll need to look through the referenced paper to make sure that I understand the definition of the predecessor. My guess is that it will take more time to do that then to actually code it... @kno10 can probably do this faster, or alternatively provide some guidance on reviewing the change.

adrinjalali · 2018-09-15T13:10:38Z

sklearn/cluster/optics_.py

-                       s + 1, node.end, node)
+
+    # check if s is xi-steep downward
+    if reachability_plot[s] * (1 - xi) >= reachability_plot[s + 1]:


use proper _steep_upward and/or _steep_downward instead.

espg · 2018-09-16T03:23:58Z

@adrinjalali I'm about halfway done reading the reference paper. We may be able to add predecessor as a post-processing step on the reachability graph itself, rather than in the main computation loop (i.e., by recursively searching backwards for the last instance of reachability that exceeds the current reachability value)... still wrapping my head on the exact formalism of the predecessor value, and the best (cleanest?) way to include it in the OPTICS output.

adrinjalali · 2018-09-16T08:18:55Z

Thanks @espg, At least to test the parts I'd be writing using the predecessor, I'd be fine starting with the slow version, and then improving the performance if possible, as long as they're consistent.

kno10 · 2018-09-16T11:48:19Z

@espg you cannot just use the previous point with higher reachability on the plot as predecessor. You need to store the actual edge of the spanning tree used to distinguish these cases.
I haven't looked at the sklearn implementation (and, e.g., with DBSCAN the sklearn version is using a different, potentially much more memory intensive, algorithm instead of the original DBSCAN). But it should be easy if your code is close to the original OPTICS: just add another column to store the predecessor.

kno10 · 2018-09-16T11:51:23Z

As a side note, I doubt this claim is correct: "Optimized for usage on large point datasets."

It will actually do twice as much work as necessary. First it computes the k-NN distance for each point, stores that, and then later again performs a radius query for each point. In particular, it will temporarily need O(N*k) memory, while the original OPTICS only needs O(N) memory, and only radius queries (so I would expect it to be faster, why query twice?).

I doubt this improves scalability...

Above claim should either be explained (how is this "optimized"?) or removed.

espg · 2018-09-16T12:40:47Z

It does not compute all pairwise distances at once. If it did, I would not be able to run 100 million point datasets on my laptop. DBSCAN usually crashes at sets above 40k because of memory usage. Yes, with max_bound set to inf, you will calculate all pairwise distances, that is true-- of any OPTICS implementation-- but the distance matrix is done one row at a time instead of all at once, so memory usage is O(n), not n-squared. As I said, I have not finished reading your paper, but it is not clear to me where predecessor comes from. There is the main OPTICS loop, and we could record the point index everytime that expand cluster order (also called expand seeds) is called, but this wouldn't assign predecessor to every point...although we could of course repeat the predecessor value for every subsequent point until expand cluster order is called again. Alternatively, we could assign the predecessor as the point index prior to sorting the reachability candidates by there reachability distance. I'm not which is correct yet, although if you have some insight on this, I'd certainly appreciate it.

jnothman · 2018-09-16T23:14:33Z

As @espg said, we are not computing all neighborhoods at once. Very intentionally this has departed from the runtime-optimised DBSCAN implementation here.

kno10 · 2018-09-16T23:41:59Z

This will still need O(n*k) memory, won't it? min_samples may need to be 100+ for large data sets.
Plus it takes extra time to do this and the radius searches.

scikit-learn/sklearn/cluster/optics_.py

Lines 381 to 382 in 1957020

    
           self.core_distances_[:] = nbrs.kneighbors(X, 
        
                                                     self.min_samples)[0][:, -1]

espg · 2018-09-17T01:21:26Z

@kno10 Most of your paper discusses how to use predecessor information, and there is very little information regarding how you actually extract it. There also isn't a formal definition of what predecessor is as related to the reference OPTICS algorithm. Predecessor to what? Predecessor to ExpandClusterOrder? Predecessor to OrderSeeds? You do reproduce an abbreviated version of the OPTICS pseudo code, but don't include any reference in the pseudo code to where predecessor is defined or extracted for your contribution.

I'd like to know why we can't extract it from the reachability plot directly. This is the conceptual diagram that seems to provide the closest formalism to what predecessor actually is:

This seems to imply that we can start at the end of the reachability plot, and search backwards for the last instance that exceeded the reachability distance of the current point; the index of that point would be the predecessor... repeating this moving backwards would yield predecessors for each point. This is a simple linear scan that isn't terribly hard to implement. Then we'd simply check if an end point on the reachability graph has a predecessor that belongs to the current cluster, or to a different cluster.

jnothman · 2018-09-17T03:09:03Z

This will still need O(n*k) memory, won't it? min_samples may need to be 100+ for large data sets. Plus it takes extra time to do this *and* the radius searches.

You're quite right. That kneighbors call looks entirely wasted. We should just be calling radius_neighbors on each point (or a chunk of points at a time), which will either return at least min_samples results, from which the core distance and the reachability distance can be set, or will return fewer points. I can see other redundant work in the code there (the pairwise_distances call is unnecessary). Annoyed at myself for missing this.

espg · 2018-09-17T04:10:04Z

For what it's worth, this was considered at some point during the original PR. There is object overhead for calling the distance function in the python loop... it was far cheaper in terms of runtime to do the core_distances all at once then to do it in the loop. Also, doing it this way enables n_jobs to be set to use multiprocessing, parallelizing a portion of an algorithm that is generally not parallelizable. That said, it's probably worth looking at other solutions to this... but we should benchmark them to make sure that they actually improve the algorithm performance.

jnothman · 2018-09-17T06:04:38Z

I suppose that if many points do not have min_samples neighbors within max_eps then you are right that we will spend extra time getting their core distance. Otherwise (and acknowledging that max_eps = np.inf by default) I can't see how precomputing core distances can help much. But if it can, we should chunk it to avoid O(n * k) memory. The only advantage I can see is that running radius_neighbors will not find the (min_samples -1)th element, as it does not sort its results.

espg · 2018-09-17T06:46:45Z

I should be asleep right now, but there are things that have been mentioned in this thread that are distracting me...

I can see other redundant work in the code there (the pairwise_distances
call is unnecessary).

This isn't true.

scikit-learn/sklearn/cluster/optics_.py

Lines 423 to 424 in 1957020

    
           indices = nbrs.radius_neighbors(P, radius=self.max_eps, 
        
                                           return_distance=False)[0]

The line above calls a neighborhood function without calculating distances or sorting results. It is much faster then calling radius neighbors normally, and is prime benefit of passing the data in a spatially indexed form. This line:

scikit-learn/sklearn/cluster/optics_.py

Lines 427 to 428 in 1957020

    
           unproc = np.compress((~np.take(processed, indices)).ravel(), 
        
                                indices, axis=0)

Filters out the points that have been already processed. Finally, this line:

scikit-learn/sklearn/cluster/optics_.py

Lines 434 to 435 in 1957020

    
           dists = pairwise_distances(P, np.take(X, unproc, axis=0), 
        
                                      self.metric, n_jobs=1).ravel()

Calculates the distances. We have to filter out points that have been already been processed-- it's part of the OPTICS algorithm. Calling radius neighbors "normally" will automatically call pairwise distances to return the distances. We are suppressing the distance calculations until after the points have been pruned to only unprocessed candidates. We are not performing redundant distance calculations. Assuming max bound = inf, the three lines above cause us to calculate the upper-triangle of the distance matrix; if we didn't have the pairwise call and just used radius neighbors on it's own, we calculate the full distance matrix. We used to do this; it was changed and benchmarked, and found to be ~70 percent faster. Again, the removal of the processed points has to happen anyways-- we might as well not calculate the distances for the points that aren't candidates.

As for this point:

This will still need O(n*k) memory, won't it? min_samples may need to be 100+ for large data sets.
Plus it takes extra time to do this and the radius searches.

I don't really see how you can get around doing the radius searches and the k-nieghbors searches. You don't know a priori what the epsilon / max bound value is going to be. Maybe epsilon is large enough so that the range query will include the core distance point, but we have no way of knowing ahead of time what the user is going to select for either min_points or max_bound. If you have a dataset that has 1,000,000 points, and you want to set min_points = 100, then you have to do a million k=100 nearest neighbor queries to find out the core distances. There isn't anyway around that-- you need to select the k=100 neighbors, and then calculate the distance of all of them to sort and then determine the 100th closest point index and it's distance. You can do this in the loop if you want, or you can do it en mass at the beginning of the algorithm.

If you do it in the loop, you reduce memory. Maybe that matters to the user, maybe it doesn't-- a million points queried en mass for the 100 nearest neighbors will use 8GB of ram, which is likely fine for most users.

If you have enough ram, it's substantially faster to do it en mass. Rather then calling the same python function a million times, you just call it once and avoid a massive amount of python object overhead. Plus, you can use multiprocessing. Using multiprocessing to calculate a 100 million distances in the same call is great-- you can't do that in the loop, because you'll setup and tear down multiprocessing a million times to multiprocess 100 distance lookups.

So yes, I think that chunking the mass lookup of core_distances is a great idea. But I don't think that we should move it back inside the main loop given that we know that it will be slower. And I don't think that you can merge the core_distance lookups with the radius query without a substantial amount of assumptions that may not be true. One of which would probably require calculating distances for the full distance matrix and already imposing a 70% performance penalty.

jnothman · 2018-09-17T07:32:35Z

Okay. I see what you're getting at, and I finally recall that we're going round in circles with respect to earlier analysis and conversations (maybe we should add some comments in the code summarising these conclusions to avoid repeated effort). Chunking that initial kneighbors lookup would be great to avoid asymptotic memory use. I agree that we may not benefit otherwise from optimisations here.

kno10 · 2018-09-17T08:10:11Z

Hi,
The predecessor arises naturally from reachability. It is the point from which you reached the new point. That seemed obvious to me from the notation and definition of reachability, but yes, it is nor formally defined in the paper... no, do not use the illustrative figure as definition.

You do not need k-nearest neighbors at all. Not in batch, not one at a time. You can get the k-distance for free from the epsilon search (or it is actually defined to be undefined; and you can simply use infinity for this purpose). That is how the OPTICS algorithm is defined. (And if the implementation departs from the original logic, it should be well documented that this is not exactly OPTICS but a variation).

I don't see why knnquery + epsilon query should be much cheaper than just an epsilon query if we need most of the distances afterwards anyway. In particular not if you consider different distances. It may hold for the k-d-tree (which works on coordinates, not distances), but not for the ball tree, for example.

qinhanmin2014 · 2018-09-19T11:20:11Z

I can see other redundant work in the code there (the pairwise_distances
call is unnecessary).

This isn't true.

@espg Maybe you can add more comments.

jnothman · 2018-09-19T14:22:12Z

I think @espg argued well.

You do not need k-nearest neighbors at all. Not in batch, not one at a time. You can get the k-distance for free from the epsilon search.

No you can't get the kth neighbor distance for free from radius neighbors without sorting / partitioning.

espg · 2018-09-19T17:01:40Z

@kno10 most of the following text concerns worst case complexity when max_bound / eps is set to inf; that said, both implementations will of course run faster and do less computations when that parameter is set lower.

you can get the k-distance for free from the epsilon search

This is true if you are calculating the full N**2 distance matrix (which you will if max_bound / eps is set to inf)... but it's not that you are getting it for 'free', you are paying for an N by N distance matrix calculation and rolling the cost of the k_nieghbors into that data structure

I don't see why knnquery + epsilon query should be much cheaper than just an epsilon query if we need most of the distances afterwards anyway.

Calculating only the upper triangle of the distance matrix reduces the number of distances that need to be calculated by half... at the cost of still needing computation for the core_dists via k-nearest neighbors. So we have [O(N*N)] / 2] + (K*N) distance calculations. If 'K' is less then N*0.5, it's cheaper to use knnquery + filtered_pairwise(epsilon query). Since 'K' is set by min_points, and since min_points is almost always much, much less then 'N', we have better runtime complexity for the vast majority of use cases. A dataset of a million samples called with min_points = 500 will need fewer distance calculations in our implementation; the same dataset called with min_points = 500,001 will require more calculations in our implementation... but I seriously doubt that there is any utility to running OPTICS with min_points set that high, as you are basically asking for one cluster.

In particular not if you consider different distances.

The assumption is that this line:

scikit-learn/sklearn/cluster/optics_.py

Lines 423 to 424 in 1957020

    
           indices = nbrs.radius_neighbors(P, radius=self.max_eps, 
        
                                           return_distance=False)[0]

...is cheap to call since we aren't asking for distances or sorting, and that we are able to retrieve the indices of these point without calculating distances or sorting. That assumption is valid for all distance metrics that obey the triangle inequality-- which is all of the named distance metrics in sklearn. For custom distance metrics that don't obey the triangle equality, things will be much slower-- both because the unoptimized python code will be used for the actual distance calculation, and because the neighbors call will have to calculate distances to return knn-query indices, removing our ability to filter candidate points prior to calling pairwise. All this is to say that OPTICS will still work with custom distances that don't obey the triangle inequality, but it will be much slower.

@jnothman i think you're right about adding comments into the code so that we don't rehash the computational complexity arguments. I'll add proper doc strings to _calculate_optics_order and _set_reach_dist that reference the runtime and memory cost of the functions so that we don't accidentally regress on the performance. That will include the k*n memory complexity of the non-chunked core_distance lookups, which we can update/replace after we merge a PR that does the chunking for that.

TomDLT · 2018-09-19T17:08:41Z

Note that #12103 does the chunking for the computation of
self.core_distances_ = nbrs.kneighbors(X, self.min_samples)[0][:, -1].

kno10 · 2018-09-20T14:59:47Z

@espg In my experience, the possible savings from not computing the distances during the radius search are fairly small at least for a small query radius. If you set the query radius huge (which you shouldn't), then of course this can pay off (up to a factor of 2 in your case here). And on the other side, the index query may compute a lot of distances - possibly all of them - and then you end up recomputing them a second time, and may be 2 times slower.

Without brute, it does this:

scikit-learn/sklearn/neighbors/base.py

Lines 689 to 703 in 1049fb1

    
           results = pairwise_distances_chunked( 
        
               X, self._fit_X, reduce_func=reduce_func, 
        
               metric=self.effective_metric_, n_jobs=self.n_jobs, 
        
               **kwds) 
        
           if return_distance: 
        
               dist_chunks, neigh_ind_chunks = zip(*results) 
        
               dist_list = sum(dist_chunks, []) 
        
               neigh_ind_list = sum(neigh_ind_chunks, []) 
        
               # See https://github.com/numpy/numpy/issues/5456 
        
               # if you want to understand why this is initialized this way. 
        
               dist = np.empty(len(dist_list), dtype='object') 
        
               dist[:] = dist_list 
        
               neigh_ind = np.empty(len(neigh_ind_list), dtype='object') 
        
               neigh_ind[:] = neigh_ind_list 
        
               results = dist, neigh_ind

the only thing you get to save is the square root for Euclidean distance. So that is the base case, isn't it? Now you are telling the code to discard all these distances.

With indexes, things of course get much more complicated.

If a leaf node is completely within the query radius (frequent, if you have a huge query radius), you do indeed save distance computations:

scikit-learn/sklearn/neighbors/binary_tree.pxi

Lines 2081 to 2094 in 3a3adab

    
           # Case 2: all node points are within distance r 
        
           #         add all points to neighbors 
        
           elif dist_UB <= r: 
        
               if count_only: 
        
                   count += (node_info.idx_end - node_info.idx_start) 
        
               else: 
        
                   for i in range(node_info.idx_start, node_info.idx_end): 
        
                       if (count < 0) or (count >= self.data.shape[0]): 
        
                           return -1 
        
                       indices[count] = idx_array[i] 
        
                       if return_distance: 
        
                           distances[count] = self.dist(pt, (data + n_features 
        
                                                             * idx_array[i]), 
        
                                                        n_features)

But whenever a leaf is not completely within the query radius, all distances are computed anyway:

scikit-learn/sklearn/neighbors/binary_tree.pxi

Lines 2097 to 2115 in 3a3adab

    
           #------------------------------------------------------------ 
        
           # Case 3: this is a leaf node.  Go through all points to 
        
           #         determine if they fall within radius 
        
           elif node_info.is_leaf: 
        
               reduced_r = self.dist_metric._dist_to_rdist(r) 
        
               for i in range(node_info.idx_start, node_info.idx_end): 
        
                   dist_pt = self.rdist(pt, (data + n_features * idx_array[i]), 
        
                                        n_features) 
        
                   if dist_pt <= reduced_r: 
        
                       if (count < 0) or (count >= self.data.shape[0]): 
        
                           return -1 
        
                       if count_only: 
        
                           pass 
        
                       else: 
        
                           indices[count] = idx_array[i] 
        
                           if return_distance: 
        
                               distances[count] =\ 
        
                                   self.dist_metric._rdist_to_dist(dist_pt)

and you are, again, only saving the "sqrt" for Euclidean distance only.

So you may be overestimating how many distance computations can be saved here, in particular for good parameters and non-Euclidean distance. Not sure if these optimizations are desirable or overoptimizing for a particular case...

And of course, much of this work was already done before once more, for kNN search.

Sorting cost, again, really does not matter. You can easily use a quick select, too.

espg · 2018-09-20T16:42:45Z

Not sure if these optimizations are desirable or overoptimizing for a particular case...

I think that the optimizations are appropriate for the most common use case. The default is that max_bound is set to inf, and I expect that most users will keep it that way. Setting max_bound to inf gives you the complete dendrogram, and if you don't know much about your data is an appropriate choice. Most users will stick with the defaults, and only change max_bound if they have particularly large datasets.

@kno10 I agree that things get complicated with spatial indices; you have to consider the cost of building the tree/index, and how it scales as a function of both the distance metric and the number of feature dimensions. This implementation is optimized to have good 'worse case' performance; the performance when using max_bound to truncate spatial queries seems to be broadly O(N log N) in the empirical benchmarks that we have run. It is entirely possible that the different design choices would reflect a better 'C' constant and lower runtime when using max_bound... but again, I think that the bound on worse case complexity is more valuable since this is the default that we set on the algorithm.

kno10 · 2018-09-20T18:03:32Z

If you want to optimize the eps=infinity case, do that properly: without a query (that will return everything... and will be in O(N) per query, i.e., O(n²) total.)

Right now, I see an implementation with very heuristic modifications to the original, published, algorithm with no proven support, and the risk of also negatively impacting run-time in other cases...

kno10 · 2018-09-23T00:00:11Z

Storing the predecessor is easy, see above pull request.

adrinjalali added 5 commits September 10, 2018 10:08

add Xi steep method for split points

46cd69d

add the test

960ddcf

fix last point issue

e8d3c93

add docstring for xi

62c4ef8

pep8

1957020

adrinjalali mentioned this pull request Sep 10, 2018

OPTICS detecting the wrong outlier #11677

Closed

adrinjalali changed the title ~~OPTICS include split points in clusters if not noise~~ [MRG] OPTICS include split points in clusters if not noise Sep 10, 2018

adrinjalali mentioned this pull request Sep 12, 2018

Remove OPTICS from 0.20 #12053

Merged

adrinjalali changed the title ~~[MRG] OPTICS include split points in clusters if not noise~~ [WIP] OPTICS include split points in clusters if not noise Sep 15, 2018

adrinjalali commented Sep 15, 2018

View reviewed changes

jnothman mentioned this pull request Sep 17, 2018

ENH Avoid redundant computation in OPTICS #12094

Closed

kno10 mentioned this pull request Sep 22, 2018

ENH Remember predecessor in OPTICS #12135

Merged

kno10 mentioned this pull request Oct 11, 2018

[MRG+1] Fix OPTICS processing order #12357

Merged

adrinjalali mentioned this pull request Feb 26, 2019

[MRG+1] OPTICS fit uses the selected extract_method parameter #12087

Merged

GaelVaroquaux closed this in #12087 Mar 3, 2019

adrinjalali deleted the optics/split-points branch May 12, 2022 12:45

Uh oh!

[WIP] OPTICS include split points in clusters if not noise #12049

[WIP] OPTICS include split points in clusters if not noise #12049

Uh oh!

Conversation

adrinjalali commented Sep 10, 2018

Uh oh!

adrinjalali commented Sep 12, 2018

Uh oh!

espg commented Sep 12, 2018

Uh oh!

adrinjalali Sep 15, 2018

Choose a reason for hiding this comment

Uh oh!

espg commented Sep 16, 2018

Uh oh!

adrinjalali commented Sep 16, 2018

Uh oh!

kno10 commented Sep 16, 2018

Uh oh!

kno10 commented Sep 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espg commented Sep 16, 2018 via email • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 16, 2018 via email

Uh oh!

kno10 commented Sep 16, 2018

Uh oh!

espg commented Sep 17, 2018

Uh oh!

jnothman commented Sep 17, 2018 via email • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espg commented Sep 17, 2018

Uh oh!

jnothman commented Sep 17, 2018 via email

Uh oh!

espg commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 17, 2018

Uh oh!

kno10 commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinhanmin2014 commented Sep 19, 2018

Uh oh!

jnothman commented Sep 19, 2018

Uh oh!

espg commented Sep 19, 2018

Uh oh!

TomDLT commented Sep 19, 2018

Uh oh!

kno10 commented Sep 20, 2018

Uh oh!

espg commented Sep 20, 2018

Uh oh!

kno10 commented Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kno10 commented Sep 23, 2018

Uh oh!

Uh oh!

kno10 commented Sep 16, 2018 •

edited

Loading

espg commented Sep 16, 2018 via email •

edited by TomDLT

Loading

jnothman commented Sep 17, 2018 via email •

edited by TomDLT

Loading

espg commented Sep 17, 2018 •

edited

Loading

kno10 commented Sep 17, 2018 •

edited

Loading

kno10 commented Sep 20, 2018 •

edited

Loading