-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] OPTICS fit uses the selected extract_method
parameter
#12087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] OPTICS fit uses the selected extract_method
parameter
#12087
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests, please
1e8e963
to
58913d8
Compare
I've added two tests, but we could add more edge case tests if you wish. I feel like, either there should be a public |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlnk is not referenced by that name anywhere in the documentation.
I suppose this is a reasonable design. Perhaps let's make the extract_dbscan
method into a public function extract_optics_dbscan
and similar for sqlnk, xi, ...? That way they are simplified to not need the if blah is None: blah = self.blah
logic and to explicitly account for ordering
and reachability
as the sufficient statistics.
I guess we could have all those functions public as you say, and let the user either pass whatever's in the instance or explicitly set the values. My intention with the construct (i.e. having the default as None and not our default values) was to make it convenient for users to set only one or few parameters and the rest taken from the instance, which could be set beforehand and something other than the default values. But I guess it's not necessary. |
Oh, |
extract_method
parameterextract_method
parameter
Oh, extract_dbscan actually checks weather the given eps has a reasonable
compared to max_eps, which we won't be giving to extract_optics_dbscan if
made a separate public method. Should we then keep them in the class?
Shrug. We could warn in the class but not in the function.
…On Tue, 18 Sep 2018 at 03:04, Adrin Jalali ***@***.***> wrote:
Oh, extract_dbscan actually checks weather the given eps has a reasonable
compared to max_eps, which we won't be giving to extract_optics_dbscan if
made a separate public method. Should we then keep them in the class?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#12087 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60VGSFmkdfErR7TcdaCXN3NSJgsyks5ub9YlgaJpZM4WqVdm>
.
|
@jnothman, gentle review ping :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to look at this properly soon...
sklearn/cluster/optics_.py
Outdated
extract_method : string, optional (default='sqlnk') | ||
The extraction method used to extract clusters using the calculated | ||
reachability and ordering. Possible values are "dbscan" | ||
and "sqlnk". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlnk is not mentioned in the repo. The user has little way to identify its meaning.
We'd need to improve the user guides once we're done with the implementations anyway, but for now I added a reference to the class's description for the method. |
FYI @adrinjalali All the known bugs of CD/RD/order/predecessor have been fixed. You can try Joel's solution here. Meanwhile, we need to re-evaluate the OPTICS-related issues you raised and figure out whether these bugs still exist. |
@jnothman I thought we decided not to have those methods and only have the public functions. Wanna go back to that discussion? :p |
extract_method
parameterextract_method
parameter
I think we probably decided we'd leave the methods discussion until later.
But I think it's a good point that the dbscan functionality is useless in
the estimator without memory of some kind.
|
We can tackle that on the separate yet_to_come |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from a cursory look: I haven't check the mathematical validity of all lines.
@@ -116,8 +116,6 @@ | |||
n_clusters=params['n_clusters'], eigen_solver='arpack', | |||
affinity="nearest_neighbors") | |||
dbscan = cluster.DBSCAN(eps=params['eps']) | |||
optics = cluster.OPTICS(min_samples=30, maxima_ratio=.8, | |||
rejection_ratio=.4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is optics being removed from this example?
I find this example very useful to have a big picture view of clustering in scikit-learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example will be added when we add the xi
method (the method from the original paper) in #12077. This moves the OPTICS to an intermediate stage for us to focus on the other PR, and then better examples, and putting OPTICS back in examples.
|
||
if self.cluster_method not in ['dbscan']: | ||
raise ValueError("cluster_method should be one of" | ||
" 'dbscan', but is %s" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds a little bit like multiple-personality disorder. If there is only one option, then the phrasing my be disturbing to newcomers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is not putting OPTICS in a stable and usable state, it's applying most decisions we made durint the sprint, documented also here: #12044 (comment)
return_distance=False)[0] | ||
|
||
# Getting indices of neighbors that have not been processed | ||
unproc = np.compress((~np.take(processed, indices)).ravel(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty crowded set of operations. Consider splitting into 2 lines for better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is not touching those lines. These parts are from other PRs and I don't want to touch them here. They're only in the changelog cause they've been moved around.
The example will be added when we add the xi method (the method from the original paper)
Sounds right! Thanks
|
@GaelVaroquaux BTW, the math behind calculating the reachability and core distances were checked thoroughly before in other PRs, this PR is just moving them around. So I'd say that part is safe. It'd be nice to either merge this or have more reviews on it. |
Point taken. It's hard to have big picture on this PR. This next one will help. Merging! |
Hi, I'm not sure this is the correct place to ask this, but I guess that this PR has something to do with my issues. Thanks, |
@xluccianox yes, after this PR, OPTICS on master is almost merely a fancy DBSCAN. #12077 is the one implementing the rest of OPTICS! |
Sorry Luciano, we have left master in a poor state and should probably be
doing this in another branch. Optics you were using in master was doing
non-standards and perhaps over-complicated things. But it was "working"
while the current version in master is not.
|
ping @adrinjalali we support precomputed distance matrix? Am I wrong? |
Yes @qinhanmin2014 , we added it here: #12028 |
So the doc is wrong? @adrinjalali |
Yes it is, seems like we forgot to add |
Hi everybody, first thanks for your work ! I have a little question about this topic. If I take into account the fact that extract_method='xi' is not yet implemented and so I can't use min_cluster_size parameter (that's right ?). I don't understand why when I run : OPTICS(eps=5, max_eps=10, min_samples=5, algorithm='auto',metric='precomputed').fit(dist) there are clusters with less than min_samples points. If you have less than min_samples in a cluster, why this points are not considered as noise points ? |
The min_samples in this context is just to find core samples, and doesn't have too much to do with the cluster size really. Here there's a simple threshold on the reachability plot is applied (5 in the case of your example), and then the clusters are reported. |
@adrinjalali I think so that I don't understand well the role of min_samples. When you say "just to find core samples" these samples are not precisely the base of our clusters ? So if I have to define a min_size for my clusters (as in DB_scan when I define Min_Pts, I have not cluster with less than Min_Pts) it is possible ? |
@vegasmagician it has the same meaning as minPts in DBSCAN:
DBSCAN can also produce clusters with fewer than minPts points, but that is a rare corner case. In OPTICS this is common, because of the hierarchical nature. A cluster can contain two subclusters and a single additional point connecting the two clusters. A naive approach using labels will see this as a cluster with a single point only. |
Thanks for your answer, I understand better. |
@vegasmagician you would be able to get the hierarchy after #12077 is merged. |
…it-learn#12087) * add extract method parameter * change fancy to sqlnk * test for invalid extract_method value * test easy dbscan * pep8 * add extract_sqlnk * mention which parameter is used for which extact method * add tests for extract methods with no params given * Add the reference for the SQLNK method. * make extract_optics_dbscan and extract_optics_sqlnk public * fix references, sync docstrings * fix more morege conflicts * reorganize the code and get it closer to what we want * pep8 * remove sqlnk, make compute_optics_graph public * pep8 * remove unused reference * make tests pass * fix docstrings * fix removal of removed parameter * add back min_cluster size, xi needs it * pep8 * fix docstring issue * address Hanmin's comments * remove core_sample_indices_ * fix comment * pep8 * apply comments, remove example * add public functions to classes.rst and __init__ * address Joel's comments
…er (scikit-learn#12087)" This reverts commit 2216111.
…er (scikit-learn#12087)" This reverts commit 2216111.
…it-learn#12087) * add extract method parameter * change fancy to sqlnk * test for invalid extract_method value * test easy dbscan * pep8 * add extract_sqlnk * mention which parameter is used for which extact method * add tests for extract methods with no params given * Add the reference for the SQLNK method. * make extract_optics_dbscan and extract_optics_sqlnk public * fix references, sync docstrings * fix more morege conflicts * reorganize the code and get it closer to what we want * pep8 * remove sqlnk, make compute_optics_graph public * pep8 * remove unused reference * make tests pass * fix docstrings * fix removal of removed parameter * add back min_cluster size, xi needs it * pep8 * fix docstring issue * address Hanmin's comments * remove core_sample_indices_ * fix comment * pep8 * apply comments, remove example * add public functions to classes.rst and __init__ * address Joel's comments
Fixes #12036, related to #12044
Fixes #12375
Closes #12049
Closes #11857
Add
extract_method
parameter toOPTICS
init, and make fit use the appropriate method.