Skip to content

[MRG+1] OPTICS fit uses the selected extract_method parameter #12087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Sep 15, 2018

Fixes #12036, related to #12044
Fixes #12375
Closes #12049
Closes #11857

Add extract_method parameter to OPTICS init, and make fit use the appropriate method.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests, please

@adrinjalali adrinjalali force-pushed the optics/choose_extractor branch from 1e8e963 to 58913d8 Compare September 16, 2018 09:31
@adrinjalali
Copy link
Member Author

I've added two tests, but we could add more edge case tests if you wish.

I feel like, either there should be a public extract_ method for all extract methods, or all should be private. I'm happy to have them all public with good docstrings to better explain the parameters and their relation to the extract methods.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlnk is not referenced by that name anywhere in the documentation.

I suppose this is a reasonable design. Perhaps let's make the extract_dbscan method into a public function extract_optics_dbscan and similar for sqlnk, xi, ...? That way they are simplified to not need the if blah is None: blah = self.blah logic and to explicitly account for ordering and reachability as the sufficient statistics.

@adrinjalali
Copy link
Member Author

I guess we could have all those functions public as you say, and let the user either pass whatever's in the instance or explicitly set the values. My intention with the construct (i.e. having the default as None and not our default values) was to make it convenient for users to set only one or few parameters and the rest taken from the instance, which could be set beforehand and something other than the default values. But I guess it's not necessary.

@adrinjalali
Copy link
Member Author

Oh, extract_dbscan actually checks weather the given eps has a reasonable compared to max_eps, which we won't be giving to extract_optics_dbscan if made a separate public method. Should we then keep them in the class?

@adrinjalali adrinjalali changed the title [WIP] OPTICS fit uses the selected extract_method parameter [MRG] OPTICS fit uses the selected extract_method parameter Sep 20, 2018
@jnothman
Copy link
Member

jnothman commented Sep 22, 2018 via email

@adrinjalali
Copy link
Member Author

@jnothman, gentle review ping :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to look at this properly soon...

extract_method : string, optional (default='sqlnk')
The extraction method used to extract clusters using the calculated
reachability and ordering. Possible values are "dbscan"
and "sqlnk".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlnk is not mentioned in the repo. The user has little way to identify its meaning.

@adrinjalali
Copy link
Member Author

We'd need to improve the user guides once we're done with the implementations anyway, but for now I added a reference to the class's description for the method.

@qinhanmin2014
Copy link
Member

FYI @adrinjalali All the known bugs of CD/RD/order/predecessor have been fixed. You can try Joel's solution here. Meanwhile, we need to re-evaluate the OPTICS-related issues you raised and figure out whether these bugs still exist.

@adrinjalali
Copy link
Member Author

@jnothman I thought we decided not to have those methods and only have the public functions. Wanna go back to that discussion? :p

@adrinjalali adrinjalali changed the title [MRG] OPTICS fit uses the selected extract_method parameter [MRG+1] OPTICS fit uses the selected extract_method parameter Mar 1, 2019
@jnothman
Copy link
Member

jnothman commented Mar 1, 2019 via email

@adrinjalali
Copy link
Member Author

We can tackle that on the separate yet_to_come warm_start PR.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a cursory look: I haven't check the mathematical validity of all lines.

@@ -116,8 +116,6 @@
n_clusters=params['n_clusters'], eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=params['eps'])
optics = cluster.OPTICS(min_samples=30, maxima_ratio=.8,
rejection_ratio=.4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is optics being removed from this example?

I find this example very useful to have a big picture view of clustering in scikit-learn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example will be added when we add the xi method (the method from the original paper) in #12077. This moves the OPTICS to an intermediate stage for us to focus on the other PR, and then better examples, and putting OPTICS back in examples.


if self.cluster_method not in ['dbscan']:
raise ValueError("cluster_method should be one of"
" 'dbscan', but is %s" %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds a little bit like multiple-personality disorder. If there is only one option, then the phrasing my be disturbing to newcomers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not putting OPTICS in a stable and usable state, it's applying most decisions we made durint the sprint, documented also here: #12044 (comment)

return_distance=False)[0]

# Getting indices of neighbors that have not been processed
unproc = np.compress((~np.take(processed, indices)).ravel(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty crowded set of operations. Consider splitting into 2 lines for better readability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not touching those lines. These parts are from other PRs and I don't want to touch them here. They're only in the changelog cause they've been moved around.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 2, 2019 via email

@adrinjalali
Copy link
Member Author

@GaelVaroquaux BTW, the math behind calculating the reachability and core distances were checked thoroughly before in other PRs, this PR is just moving them around. So I'd say that part is safe.

It'd be nice to either merge this or have more reviews on it.

@GaelVaroquaux
Copy link
Member

Point taken.

It's hard to have big picture on this PR. This next one will help.

Merging!

@GaelVaroquaux GaelVaroquaux merged commit 40441e8 into scikit-learn:master Mar 3, 2019
@adrinjalali adrinjalali deleted the optics/choose_extractor branch March 4, 2019 10:13
@righel
Copy link

righel commented Mar 15, 2019

Hi, I'm not sure this is the correct place to ask this, but I guess that this PR has something to do with my issues.
I'm using this OPTICS implementation and I noticed a considerable change in the results after this PR was merged to master. Any summary on what changed? and why the results with same input data and clustering parameters differ so much?

Thanks,
Luciano.

@adrinjalali
Copy link
Member Author

@xluccianox yes, after this PR, OPTICS on master is almost merely a fancy DBSCAN. #12077 is the one implementing the rest of OPTICS!

@jnothman
Copy link
Member

jnothman commented Mar 16, 2019 via email

@qinhanmin2014
Copy link
Member

Distance matrices are not supported.

ping @adrinjalali we support precomputed distance matrix? Am I wrong?

@adrinjalali
Copy link
Member Author

Yes @qinhanmin2014 , we added it here: #12028

@qinhanmin2014
Copy link
Member

Yes @qinhanmin2014 , we added it here: #12028

So the doc is wrong? @adrinjalali

@adrinjalali
Copy link
Member Author

Yes it is, seems like we forgot to add 'precomputed' to the docstring in that PR.

@vegasmagician
Copy link

vegasmagician commented Apr 9, 2019

Hi everybody, first thanks for your work ! I have a little question about this topic.

If I take into account the fact that extract_method='xi' is not yet implemented and so I can't use min_cluster_size parameter (that's right ?). I don't understand why when I run :

OPTICS(eps=5, max_eps=10, min_samples=5, algorithm='auto',metric='precomputed').fit(dist)

there are clusters with less than min_samples points. If you have less than min_samples in a cluster, why this points are not considered as noise points ?

@adrinjalali
Copy link
Member Author

The min_samples in this context is just to find core samples, and doesn't have too much to do with the cluster size really. Here there's a simple threshold on the reachability plot is applied (5 in the case of your example), and then the clusters are reported.

@vegasmagician
Copy link

vegasmagician commented Apr 10, 2019

@adrinjalali I think so that I don't understand well the role of min_samples. When you say "just to find core samples" these samples are not precisely the base of our clusters ? So if I have to define a min_size for my clusters (as in DB_scan when I define Min_Pts, I have not cluster with less than Min_Pts) it is possible ?

@kno10
Copy link
Contributor

kno10 commented Apr 12, 2019

@vegasmagician it has the same meaning as minPts in DBSCAN:

The number of samples in a neighborhood for a point to be considered as a core point.
But given large enough max_eps every point eventually becomes a core point. So instead of a "core point property", you have a "core distance" for every point, the distance when it becomes a core point.

DBSCAN can also produce clusters with fewer than minPts points, but that is a rare corner case.

In OPTICS this is common, because of the hierarchical nature. A cluster can contain two subclusters and a single additional point connecting the two clusters. A naive approach using labels will see this as a cluster with a single point only.

@vegasmagician
Copy link

Thanks for your answer, I understand better.
So using your module, how can I show the this hierarchy of clusters ?

@adrinjalali
Copy link
Member Author

@vegasmagician you would be able to get the hierarchy after #12077 is merged.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
…it-learn#12087)

* add extract method parameter

* change fancy to sqlnk

* test for invalid extract_method value

* test easy dbscan

* pep8

* add extract_sqlnk

* mention which parameter is used for which extact method

* add tests for extract methods with no params given

* Add the reference for the SQLNK method.

* make extract_optics_dbscan and extract_optics_sqlnk public

* fix references, sync docstrings

* fix more morege conflicts

* reorganize the code and get it closer to what we want

* pep8

* remove sqlnk, make compute_optics_graph public

* pep8

* remove unused reference

* make tests pass

* fix docstrings

* fix removal of removed parameter

* add back min_cluster size, xi needs it

* pep8

* fix docstring issue

* address Hanmin's comments

* remove core_sample_indices_

* fix comment

* pep8

* apply comments, remove example

* add public functions to classes.rst and __init__

* address Joel's comments
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
…it-learn#12087)

* add extract method parameter

* change fancy to sqlnk

* test for invalid extract_method value

* test easy dbscan

* pep8

* add extract_sqlnk

* mention which parameter is used for which extact method

* add tests for extract methods with no params given

* Add the reference for the SQLNK method.

* make extract_optics_dbscan and extract_optics_sqlnk public

* fix references, sync docstrings

* fix more morege conflicts

* reorganize the code and get it closer to what we want

* pep8

* remove sqlnk, make compute_optics_graph public

* pep8

* remove unused reference

* make tests pass

* fix docstrings

* fix removal of removed parameter

* add back min_cluster size, xi needs it

* pep8

* fix docstring issue

* address Hanmin's comments

* remove core_sample_indices_

* fix comment

* pep8

* apply comments, remove example

* add public functions to classes.rst and __init__

* address Joel's comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OPTICS Reconsider whether there're unnecessary parameters OPTICS extraction methods
9 participants