Skip to content

[WIP] OPTICS include split points in clusters if not noise #12049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

adrinjalali
Copy link
Member

Fixes #11677 (OPTICS detecting the wrong outlier)
See PR #11857 (related discussions)

Borrowing the concept of Xi-steep points and the parameter, this PR suggests to include a split point in a cluster if it's a steep downward point. It also checks if the last point of a cluster is a noise.

@adrinjalali adrinjalali changed the title OPTICS include split points in clusters if not noise [MRG] OPTICS include split points in clusters if not noise Sep 10, 2018
@adrinjalali
Copy link
Member Author

@kno10, for the current extraction method, we also have the issue of not including the split points in the clusters as it stands. Here I'm proposing something similar to the xi-steep, to detect the ones which should be a part of the current cluster, and use the same principle for the last point.

I'd really appreciate your feedback on this.

@espg, do you think you could add the predecessor while extracting reachability with reasonably little effort?

(#11677 (comment))

@adrinjalali The dbscan package states: "The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation". But the same technique can (and should) be used with any other plot-based extraction technique.
The article mentioned above http://ceur-ws.org/Vol-2191/paper37.pdf explains why the "last" point of a valley may or may not belong to the cluster, and how to use the predecessor information to distinguish these two cases. If you always include the last point, you'll get "spike" artifacts.

@espg
Copy link
Contributor

espg commented Sep 12, 2018

@adrinjalali I think so. It's likely a pretty small change to store this, but I'll need to look through the referenced paper to make sure that I understand the definition of the predecessor. My guess is that it will take more time to do that then to actually code it... @kno10 can probably do this faster, or alternatively provide some guidance on reviewing the change.

@adrinjalali adrinjalali changed the title [MRG] OPTICS include split points in clusters if not noise [WIP] OPTICS include split points in clusters if not noise Sep 15, 2018
s + 1, node.end, node)

# check if s is xi-steep downward
if reachability_plot[s] * (1 - xi) >= reachability_plot[s + 1]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use proper _steep_upward and/or _steep_downward instead.

@espg
Copy link
Contributor

espg commented Sep 16, 2018

@adrinjalali I'm about halfway done reading the reference paper. We may be able to add predecessor as a post-processing step on the reachability graph itself, rather than in the main computation loop (i.e., by recursively searching backwards for the last instance of reachability that exceeds the current reachability value)... still wrapping my head on the exact formalism of the predecessor value, and the best (cleanest?) way to include it in the OPTICS output.

@adrinjalali
Copy link
Member Author

Thanks @espg, At least to test the parts I'd be writing using the predecessor, I'd be fine starting with the slow version, and then improving the performance if possible, as long as they're consistent.

@kno10
Copy link
Contributor

kno10 commented Sep 16, 2018

@espg you cannot just use the previous point with higher reachability on the plot as predecessor. You need to store the actual edge of the spanning tree used to distinguish these cases.
I haven't looked at the sklearn implementation (and, e.g., with DBSCAN the sklearn version is using a different, potentially much more memory intensive, algorithm instead of the original DBSCAN). But it should be easy if your code is close to the original OPTICS: just add another column to store the predecessor.

@kno10
Copy link
Contributor

kno10 commented Sep 16, 2018

As a side note, I doubt this claim is correct: "Optimized for usage on large point datasets."

It will actually do twice as much work as necessary. First it computes the k-NN distance for each point, stores that, and then later again performs a radius query for each point. In particular, it will temporarily need O(N*k) memory, while the original OPTICS only needs O(N) memory, and only radius queries (so I would expect it to be faster, why query twice?).

I doubt this improves scalability...

Above claim should either be explained (how is this "optimized"?) or removed.

@espg
Copy link
Contributor

espg commented Sep 16, 2018 via email

@jnothman
Copy link
Member

jnothman commented Sep 16, 2018 via email

@kno10
Copy link
Contributor

kno10 commented Sep 16, 2018

This will still need O(n*k) memory, won't it? min_samples may need to be 100+ for large data sets.
Plus it takes extra time to do this and the radius searches.

self.core_distances_[:] = nbrs.kneighbors(X,
self.min_samples)[0][:, -1]

@espg
Copy link
Contributor

espg commented Sep 17, 2018

@kno10 Most of your paper discusses how to use predecessor information, and there is very little information regarding how you actually extract it. There also isn't a formal definition of what predecessor is as related to the reference OPTICS algorithm. Predecessor to what? Predecessor to ExpandClusterOrder? Predecessor to OrderSeeds? You do reproduce an abbreviated version of the OPTICS pseudo code, but don't include any reference in the pseudo code to where predecessor is defined or extracted for your contribution.

I'd like to know why we can't extract it from the reachability plot directly. This is the conceptual diagram that seems to provide the closest formalism to what predecessor actually is:

predecessor

This seems to imply that we can start at the end of the reachability plot, and search backwards for the last instance that exceeded the reachability distance of the current point; the index of that point would be the predecessor... repeating this moving backwards would yield predecessors for each point. This is a simple linear scan that isn't terribly hard to implement. Then we'd simply check if an end point on the reachability graph has a predecessor that belongs to the current cluster, or to a different cluster.

@jnothman
Copy link
Member

jnothman commented Sep 17, 2018 via email

@espg
Copy link
Contributor

espg commented Sep 17, 2018

For what it's worth, this was considered at some point during the original PR. There is object overhead for calling the distance function in the python loop... it was far cheaper in terms of runtime to do the core_distances all at once then to do it in the loop. Also, doing it this way enables n_jobs to be set to use multiprocessing, parallelizing a portion of an algorithm that is generally not parallelizable. That said, it's probably worth looking at other solutions to this... but we should benchmark them to make sure that they actually improve the algorithm performance.

@jnothman
Copy link
Member

jnothman commented Sep 17, 2018 via email

@espg
Copy link
Contributor

espg commented Sep 17, 2018

I should be asleep right now, but there are things that have been mentioned in this thread that are distracting me...

I can see other redundant work in the code there (the pairwise_distances
call is unnecessary).

This isn't true.

indices = nbrs.radius_neighbors(P, radius=self.max_eps,
return_distance=False)[0]

The line above calls a neighborhood function without calculating distances or sorting results. It is much faster then calling radius neighbors normally, and is prime benefit of passing the data in a spatially indexed form. This line:

unproc = np.compress((~np.take(processed, indices)).ravel(),
indices, axis=0)

Filters out the points that have been already processed. Finally, this line:

dists = pairwise_distances(P, np.take(X, unproc, axis=0),
self.metric, n_jobs=1).ravel()

Calculates the distances. We have to filter out points that have been already been processed-- it's part of the OPTICS algorithm. Calling radius neighbors "normally" will automatically call pairwise distances to return the distances. We are suppressing the distance calculations until after the points have been pruned to only unprocessed candidates. We are not performing redundant distance calculations. Assuming max bound = inf, the three lines above cause us to calculate the upper-triangle of the distance matrix; if we didn't have the pairwise call and just used radius neighbors on it's own, we calculate the full distance matrix. We used to do this; it was changed and benchmarked, and found to be ~70 percent faster. Again, the removal of the processed points has to happen anyways-- we might as well not calculate the distances for the points that aren't candidates.

As for this point:

This will still need O(n*k) memory, won't it? min_samples may need to be 100+ for large data sets.
Plus it takes extra time to do this and the radius searches.

I don't really see how you can get around doing the radius searches and the k-nieghbors searches. You don't know a priori what the epsilon / max bound value is going to be. Maybe epsilon is large enough so that the range query will include the core distance point, but we have no way of knowing ahead of time what the user is going to select for either min_points or max_bound. If you have a dataset that has 1,000,000 points, and you want to set min_points = 100, then you have to do a million k=100 nearest neighbor queries to find out the core distances. There isn't anyway around that-- you need to select the k=100 neighbors, and then calculate the distance of all of them to sort and then determine the 100th closest point index and it's distance. You can do this in the loop if you want, or you can do it en mass at the beginning of the algorithm.

If you do it in the loop, you reduce memory. Maybe that matters to the user, maybe it doesn't-- a million points queried en mass for the 100 nearest neighbors will use 8GB of ram, which is likely fine for most users.

If you have enough ram, it's substantially faster to do it en mass. Rather then calling the same python function a million times, you just call it once and avoid a massive amount of python object overhead. Plus, you can use multiprocessing. Using multiprocessing to calculate a 100 million distances in the same call is great-- you can't do that in the loop, because you'll setup and tear down multiprocessing a million times to multiprocess 100 distance lookups.

So yes, I think that chunking the mass lookup of core_distances is a great idea. But I don't think that we should move it back inside the main loop given that we know that it will be slower. And I don't think that you can merge the core_distance lookups with the radius query without a substantial amount of assumptions that may not be true. One of which would probably require calculating distances for the full distance matrix and already imposing a 70% performance penalty.

@jnothman
Copy link
Member

Okay. I see what you're getting at, and I finally recall that we're going round in circles with respect to earlier analysis and conversations (maybe we should add some comments in the code summarising these conclusions to avoid repeated effort). Chunking that initial kneighbors lookup would be great to avoid asymptotic memory use. I agree that we may not benefit otherwise from optimisations here.

@kno10
Copy link
Contributor

kno10 commented Sep 17, 2018

Hi,
The predecessor arises naturally from reachability. It is the point from which you reached the new point. That seemed obvious to me from the notation and definition of reachability, but yes, it is nor formally defined in the paper... no, do not use the illustrative figure as definition.

You do not need k-nearest neighbors at all. Not in batch, not one at a time. You can get the k-distance for free from the epsilon search (or it is actually defined to be undefined; and you can simply use infinity for this purpose). That is how the OPTICS algorithm is defined. (And if the implementation departs from the original logic, it should be well documented that this is not exactly OPTICS but a variation).

I don't see why knnquery + epsilon query should be much cheaper than just an epsilon query if we need most of the distances afterwards anyway. In particular not if you consider different distances. It may hold for the k-d-tree (which works on coordinates, not distances), but not for the ball tree, for example.

@qinhanmin2014
Copy link
Member

I can see other redundant work in the code there (the pairwise_distances
call is unnecessary).

This isn't true.

@espg Maybe you can add more comments.

@jnothman
Copy link
Member

I think @espg argued well.

You do not need k-nearest neighbors at all. Not in batch, not one at a time. You can get the k-distance for free from the epsilon search.

No you can't get the kth neighbor distance for free from radius neighbors without sorting / partitioning.

@espg
Copy link
Contributor

espg commented Sep 19, 2018

@kno10 most of the following text concerns worst case complexity when max_bound / eps is set to inf; that said, both implementations will of course run faster and do less computations when that parameter is set lower.

you can get the k-distance for free from the epsilon search

This is true if you are calculating the full N**2 distance matrix (which you will if max_bound / eps is set to inf)... but it's not that you are getting it for 'free', you are paying for an N by N distance matrix calculation and rolling the cost of the k_nieghbors into that data structure

I don't see why knnquery + epsilon query should be much cheaper than just an epsilon query if we need most of the distances afterwards anyway.

Calculating only the upper triangle of the distance matrix reduces the number of distances that need to be calculated by half... at the cost of still needing computation for the core_dists via k-nearest neighbors. So we have [O(N*N)] / 2] + (K*N) distance calculations. If 'K' is less then N*0.5, it's cheaper to use knnquery + filtered_pairwise(epsilon query). Since 'K' is set by min_points, and since min_points is almost always much, much less then 'N', we have better runtime complexity for the vast majority of use cases. A dataset of a million samples called with min_points = 500 will need fewer distance calculations in our implementation; the same dataset called with min_points = 500,001 will require more calculations in our implementation... but I seriously doubt that there is any utility to running OPTICS with min_points set that high, as you are basically asking for one cluster.

In particular not if you consider different distances.

The assumption is that this line:

indices = nbrs.radius_neighbors(P, radius=self.max_eps,
return_distance=False)[0]

...is cheap to call since we aren't asking for distances or sorting, and that we are able to retrieve the indices of these point without calculating distances or sorting. That assumption is valid for all distance metrics that obey the triangle inequality-- which is all of the named distance metrics in sklearn. For custom distance metrics that don't obey the triangle equality, things will be much slower-- both because the unoptimized python code will be used for the actual distance calculation, and because the neighbors call will have to calculate distances to return knn-query indices, removing our ability to filter candidate points prior to calling pairwise. All this is to say that OPTICS will still work with custom distances that don't obey the triangle inequality, but it will be much slower.

@jnothman i think you're right about adding comments into the code so that we don't rehash the computational complexity arguments. I'll add proper doc strings to _calculate_optics_order and _set_reach_dist that reference the runtime and memory cost of the functions so that we don't accidentally regress on the performance. That will include the k*n memory complexity of the non-chunked core_distance lookups, which we can update/replace after we merge a PR that does the chunking for that.

@TomDLT
Copy link
Member

TomDLT commented Sep 19, 2018

Note that #12103 does the chunking for the computation of
self.core_distances_ = nbrs.kneighbors(X, self.min_samples)[0][:, -1].

@kno10
Copy link
Contributor

kno10 commented Sep 20, 2018

@espg In my experience, the possible savings from not computing the distances during the radius search are fairly small at least for a small query radius. If you set the query radius huge (which you shouldn't), then of course this can pay off (up to a factor of 2 in your case here). And on the other side, the index query may compute a lot of distances - possibly all of them - and then you end up recomputing them a second time, and may be 2 times slower.

Without brute, it does this:

results = pairwise_distances_chunked(
X, self._fit_X, reduce_func=reduce_func,
metric=self.effective_metric_, n_jobs=self.n_jobs,
**kwds)
if return_distance:
dist_chunks, neigh_ind_chunks = zip(*results)
dist_list = sum(dist_chunks, [])
neigh_ind_list = sum(neigh_ind_chunks, [])
# See https://github.com/numpy/numpy/issues/5456
# if you want to understand why this is initialized this way.
dist = np.empty(len(dist_list), dtype='object')
dist[:] = dist_list
neigh_ind = np.empty(len(neigh_ind_list), dtype='object')
neigh_ind[:] = neigh_ind_list
results = dist, neigh_ind

the only thing you get to save is the square root for Euclidean distance. So that is the base case, isn't it? Now you are telling the code to discard all these distances.

With indexes, things of course get much more complicated.

If a leaf node is completely within the query radius (frequent, if you have a huge query radius), you do indeed save distance computations:

# Case 2: all node points are within distance r
# add all points to neighbors
elif dist_UB <= r:
if count_only:
count += (node_info.idx_end - node_info.idx_start)
else:
for i in range(node_info.idx_start, node_info.idx_end):
if (count < 0) or (count >= self.data.shape[0]):
return -1
indices[count] = idx_array[i]
if return_distance:
distances[count] = self.dist(pt, (data + n_features
* idx_array[i]),
n_features)

But whenever a leaf is not completely within the query radius, all distances are computed anyway:
#------------------------------------------------------------
# Case 3: this is a leaf node. Go through all points to
# determine if they fall within radius
elif node_info.is_leaf:
reduced_r = self.dist_metric._dist_to_rdist(r)
for i in range(node_info.idx_start, node_info.idx_end):
dist_pt = self.rdist(pt, (data + n_features * idx_array[i]),
n_features)
if dist_pt <= reduced_r:
if (count < 0) or (count >= self.data.shape[0]):
return -1
if count_only:
pass
else:
indices[count] = idx_array[i]
if return_distance:
distances[count] =\
self.dist_metric._rdist_to_dist(dist_pt)

and you are, again, only saving the "sqrt" for Euclidean distance only.

So you may be overestimating how many distance computations can be saved here, in particular for good parameters and non-Euclidean distance. Not sure if these optimizations are desirable or overoptimizing for a particular case...

And of course, much of this work was already done before once more, for kNN search.

Sorting cost, again, really does not matter. You can easily use a quick select, too.

@espg
Copy link
Contributor

espg commented Sep 20, 2018

Not sure if these optimizations are desirable or overoptimizing for a particular case...

I think that the optimizations are appropriate for the most common use case. The default is that max_bound is set to inf, and I expect that most users will keep it that way. Setting max_bound to inf gives you the complete dendrogram, and if you don't know much about your data is an appropriate choice. Most users will stick with the defaults, and only change max_bound if they have particularly large datasets.

@kno10 I agree that things get complicated with spatial indices; you have to consider the cost of building the tree/index, and how it scales as a function of both the distance metric and the number of feature dimensions. This implementation is optimized to have good 'worse case' performance; the performance when using max_bound to truncate spatial queries seems to be broadly O(N log N) in the empirical benchmarks that we have run. It is entirely possible that the different design choices would reflect a better 'C' constant and lower runtime when using max_bound... but again, I think that the bound on worse case complexity is more valuable since this is the default that we set on the algorithm.

@kno10
Copy link
Contributor

kno10 commented Sep 20, 2018

If you want to optimize the eps=infinity case, do that properly: without a query (that will return everything... and will be in O(N) per query, i.e., O(n²) total.)

Right now, I see an implementation with very heuristic modifications to the original, published, algorithm with no proven support, and the risk of also negatively impacting run-time in other cases...

@kno10
Copy link
Contributor

kno10 commented Sep 23, 2018

Storing the predecessor is easy, see above pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants