-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] OPTICS include split points in clusters if not noise #12049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@kno10, for the current extraction method, we also have the issue of not including the split points in the clusters as it stands. Here I'm proposing something similar to the xi-steep, to detect the ones which should be a part of the current cluster, and use the same principle for the last point. I'd really appreciate your feedback on this. @espg, do you think you could add the
|
@adrinjalali I think so. It's likely a pretty small change to store this, but I'll need to look through the referenced paper to make sure that I understand the definition of the predecessor. My guess is that it will take more time to do that then to actually code it... @kno10 can probably do this faster, or alternatively provide some guidance on reviewing the change. |
s + 1, node.end, node) | ||
|
||
# check if s is xi-steep downward | ||
if reachability_plot[s] * (1 - xi) >= reachability_plot[s + 1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use proper _steep_upward
and/or _steep_downward
instead.
@adrinjalali I'm about halfway done reading the reference paper. We may be able to add |
Thanks @espg, At least to test the parts I'd be writing using the predecessor, I'd be fine starting with the slow version, and then improving the performance if possible, as long as they're consistent. |
@espg you cannot just use the previous point with higher reachability on the plot as predecessor. You need to store the actual edge of the spanning tree used to distinguish these cases. |
As a side note, I doubt this claim is correct: "Optimized for usage on large point datasets." It will actually do twice as much work as necessary. First it computes the k-NN distance for each point, stores that, and then later again performs a radius query for each point. In particular, it will temporarily need O(N*k) memory, while the original OPTICS only needs O(N) memory, and only radius queries (so I would expect it to be faster, why query twice?). I doubt this improves scalability... Above claim should either be explained (how is this "optimized"?) or removed. |
It does not compute all pairwise distances at once. If it did, I would not
be able to run 100 million point datasets on my laptop. DBSCAN usually
crashes at sets above 40k because of memory usage. Yes, with max_bound set
to inf, you will calculate all pairwise distances, that is true-- of any
OPTICS implementation-- but the distance matrix is done one row at a time
instead of all at once, so memory usage is O(n), not n-squared.
As I said, I have not finished reading your paper, but it is not clear to
me where predecessor comes from. There is the main OPTICS loop, and we
could record the point index everytime that expand cluster order (also
called expand seeds) is called, but this wouldn't assign predecessor to
every point...although we could of course repeat the predecessor value for
every subsequent point until expand cluster order is called again.
Alternatively, we could assign the predecessor as the point index prior to
sorting the reachability candidates by there reachability distance. I'm not
which is correct yet, although if you have some insight on this, I'd
certainly appreciate it.
|
As @espg said, we are not computing all neighborhoods at once. Very
intentionally this has departed from the runtime-optimised DBSCAN
implementation here.
|
This will still need O(n*k) memory, won't it? min_samples may need to be 100+ for large data sets. scikit-learn/sklearn/cluster/optics_.py Lines 381 to 382 in 1957020
|
@kno10 Most of your paper discusses how to use predecessor information, and there is very little information regarding how you actually extract it. There also isn't a formal definition of what predecessor is as related to the reference OPTICS algorithm. Predecessor to what? Predecessor to ExpandClusterOrder? Predecessor to OrderSeeds? You do reproduce an abbreviated version of the OPTICS pseudo code, but don't include any reference in the pseudo code to where predecessor is defined or extracted for your contribution. I'd like to know why we can't extract it from the reachability plot directly. This is the conceptual diagram that seems to provide the closest formalism to what predecessor actually is: This seems to imply that we can start at the end of the reachability plot, and search backwards for the last instance that exceeded the reachability distance of the current point; the index of that point would be the predecessor... repeating this moving backwards would yield predecessors for each point. This is a simple linear scan that isn't terribly hard to implement. Then we'd simply check if an end point on the reachability graph has a predecessor that belongs to the current cluster, or to a different cluster. |
This will still need O(n*k) memory, won't it? min_samples may need to be 100+ for large data sets. Plus it takes extra time to do this *and* the radius searches.
You're quite right. That kneighbors call looks entirely wasted. We should
just be calling radius_neighbors on each point (or a chunk of points at a
time), which will either return at least min_samples results, from which
the core distance and the reachability distance can be set, or will return
fewer points.
I can see other redundant work in the code there (the pairwise_distances
call is unnecessary). Annoyed at myself for missing this.
|
For what it's worth, this was considered at some point during the original PR. There is object overhead for calling the distance function in the python loop... it was far cheaper in terms of runtime to do the core_distances all at once then to do it in the loop. Also, doing it this way enables |
I suppose that if many points do not have min_samples neighbors within
max_eps then you are right that we will spend extra time getting their core
distance. Otherwise (and acknowledging that max_eps = np.inf by default) I
can't see how precomputing core distances can help much. But if it can, we
should chunk it to avoid O(n * k) memory. The only advantage I can see is
that running radius_neighbors will not find the (min_samples -1)th element,
as it does not sort its results.
|
I should be asleep right now, but there are things that have been mentioned in this thread that are distracting me...
This isn't true. scikit-learn/sklearn/cluster/optics_.py Lines 423 to 424 in 1957020
The line above calls a neighborhood function without calculating distances or sorting results. It is much faster then calling radius neighbors normally, and is prime benefit of passing the data in a spatially indexed form. This line: scikit-learn/sklearn/cluster/optics_.py Lines 427 to 428 in 1957020
Filters out the points that have been already processed. Finally, this line: scikit-learn/sklearn/cluster/optics_.py Lines 434 to 435 in 1957020
Calculates the distances. We have to filter out points that have been already been processed-- it's part of the OPTICS algorithm. Calling radius neighbors "normally" will automatically call pairwise distances to return the distances. We are suppressing the distance calculations until after the points have been pruned to only unprocessed candidates. We are not performing redundant distance calculations. Assuming max bound = inf, the three lines above cause us to calculate the upper-triangle of the distance matrix; if we didn't have the pairwise call and just used radius neighbors on it's own, we calculate the full distance matrix. We used to do this; it was changed and benchmarked, and found to be ~70 percent faster. Again, the removal of the processed points has to happen anyways-- we might as well not calculate the distances for the points that aren't candidates. As for this point:
I don't really see how you can get around doing the radius searches and the k-nieghbors searches. You don't know a priori what the epsilon / max bound value is going to be. Maybe epsilon is large enough so that the range query will include the core distance point, but we have no way of knowing ahead of time what the user is going to select for either min_points or max_bound. If you have a dataset that has 1,000,000 points, and you want to set min_points = 100, then you have to do a million k=100 nearest neighbor queries to find out the core distances. There isn't anyway around that-- you need to select the k=100 neighbors, and then calculate the distance of all of them to sort and then determine the 100th closest point index and it's distance. You can do this in the loop if you want, or you can do it en mass at the beginning of the algorithm. If you do it in the loop, you reduce memory. Maybe that matters to the user, maybe it doesn't-- a million points queried en mass for the 100 nearest neighbors will use 8GB of ram, which is likely fine for most users. If you have enough ram, it's substantially faster to do it en mass. Rather then calling the same python function a million times, you just call it once and avoid a massive amount of python object overhead. Plus, you can use multiprocessing. Using multiprocessing to calculate a 100 million distances in the same call is great-- you can't do that in the loop, because you'll setup and tear down multiprocessing a million times to multiprocess 100 distance lookups. So yes, I think that chunking the mass lookup of core_distances is a great idea. But I don't think that we should move it back inside the main loop given that we know that it will be slower. And I don't think that you can merge the core_distance lookups with the radius query without a substantial amount of assumptions that may not be true. One of which would probably require calculating distances for the full distance matrix and already imposing a 70% performance penalty. |
Okay. I see what you're getting at, and I finally recall that we're going round in circles with respect to earlier analysis and conversations (maybe we should add some comments in the code summarising these conclusions to avoid repeated effort). Chunking that initial kneighbors lookup would be great to avoid asymptotic memory use. I agree that we may not benefit otherwise from optimisations here. |
Hi, You do not need k-nearest neighbors at all. Not in batch, not one at a time. You can get the k-distance for free from the epsilon search (or it is actually defined to be undefined; and you can simply use infinity for this purpose). That is how the OPTICS algorithm is defined. (And if the implementation departs from the original logic, it should be well documented that this is not exactly OPTICS but a variation). I don't see why knnquery + epsilon query should be much cheaper than just an epsilon query if we need most of the distances afterwards anyway. In particular not if you consider different distances. It may hold for the k-d-tree (which works on coordinates, not distances), but not for the ball tree, for example. |
@espg Maybe you can add more comments. |
I think @espg argued well.
No you can't get the kth neighbor distance for free from radius neighbors without sorting / partitioning. |
@kno10 most of the following text concerns worst case complexity when max_bound / eps is set to inf; that said, both implementations will of course run faster and do less computations when that parameter is set lower.
This is true if you are calculating the full N**2 distance matrix (which you will if max_bound / eps is set to inf)... but it's not that you are getting it for 'free', you are paying for an N by N distance matrix calculation and rolling the cost of the k_nieghbors into that data structure
Calculating only the upper triangle of the distance matrix reduces the number of distances that need to be calculated by half... at the cost of still needing computation for the core_dists via k-nearest neighbors. So we have [O(N*N)] / 2] + (K*N) distance calculations. If 'K' is less then N*0.5, it's cheaper to use knnquery + filtered_pairwise(epsilon query). Since 'K' is set by min_points, and since min_points is almost always much, much less then 'N', we have better runtime complexity for the vast majority of use cases. A dataset of a million samples called with min_points = 500 will need fewer distance calculations in our implementation; the same dataset called with min_points = 500,001 will require more calculations in our implementation... but I seriously doubt that there is any utility to running OPTICS with min_points set that high, as you are basically asking for one cluster.
The assumption is that this line: scikit-learn/sklearn/cluster/optics_.py Lines 423 to 424 in 1957020
...is cheap to call since we aren't asking for distances or sorting, and that we are able to retrieve the indices of these point without calculating distances or sorting. That assumption is valid for all distance metrics that obey the triangle inequality-- which is all of the named distance metrics in sklearn. For custom distance metrics that don't obey the triangle equality, things will be much slower-- both because the unoptimized python code will be used for the actual distance calculation, and because the neighbors call will have to calculate distances to return knn-query indices, removing our ability to filter candidate points prior to calling pairwise. All this is to say that OPTICS will still work with custom distances that don't obey the triangle inequality, but it will be much slower. @jnothman i think you're right about adding comments into the code so that we don't rehash the computational complexity arguments. I'll add proper doc strings to |
Note that #12103 does the chunking for the computation of |
@espg In my experience, the possible savings from not computing the distances during the radius search are fairly small at least for a small query radius. If you set the query radius huge (which you shouldn't), then of course this can pay off (up to a factor of 2 in your case here). And on the other side, the index query may compute a lot of distances - possibly all of them - and then you end up recomputing them a second time, and may be 2 times slower. Without scikit-learn/sklearn/neighbors/base.py Lines 689 to 703 in 1049fb1
the only thing you get to save is the square root for Euclidean distance. So that is the base case, isn't it? Now you are telling the code to discard all these distances. With indexes, things of course get much more complicated. If a leaf node is completely within the query radius (frequent, if you have a huge query radius), you do indeed save distance computations: scikit-learn/sklearn/neighbors/binary_tree.pxi Lines 2081 to 2094 in 3a3adab
But whenever a leaf is not completely within the query radius, all distances are computed anyway: scikit-learn/sklearn/neighbors/binary_tree.pxi Lines 2097 to 2115 in 3a3adab
and you are, again, only saving the "sqrt" for Euclidean distance only. So you may be overestimating how many distance computations can be saved here, in particular for good parameters and non-Euclidean distance. Not sure if these optimizations are desirable or overoptimizing for a particular case... And of course, much of this work was already done before once more, for kNN search. Sorting cost, again, really does not matter. You can easily use a quick select, too. |
I think that the optimizations are appropriate for the most common use case. The default is that max_bound is set to inf, and I expect that most users will keep it that way. Setting max_bound to inf gives you the complete dendrogram, and if you don't know much about your data is an appropriate choice. Most users will stick with the defaults, and only change max_bound if they have particularly large datasets. @kno10 I agree that things get complicated with spatial indices; you have to consider the cost of building the tree/index, and how it scales as a function of both the distance metric and the number of feature dimensions. This implementation is optimized to have good 'worse case' performance; the performance when using max_bound to truncate spatial queries seems to be broadly O(N log N) in the empirical benchmarks that we have run. It is entirely possible that the different design choices would reflect a better 'C' constant and lower runtime when using max_bound... but again, I think that the bound on worse case complexity is more valuable since this is the default that we set on the algorithm. |
If you want to optimize the eps=infinity case, do that properly: without a query (that will return everything... and will be in O(N) per query, i.e., O(n²) total.) Right now, I see an implementation with very heuristic modifications to the original, published, algorithm with no proven support, and the risk of also negatively impacting run-time in other cases... |
Storing the predecessor is easy, see above pull request. |
Fixes #11677 (OPTICS detecting the wrong outlier)
See PR #11857 (related discussions)
Borrowing the concept of
Xi
-steep points and the parameter, this PR suggests to include a split point in a cluster if it's a steep downward point. It also checks if the last point of a cluster is a noise.