Skip to content

Add a more useful example to the cluster comparison #2890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
naught101 opened this issue Feb 24, 2014 · 8 comments
Closed

Add a more useful example to the cluster comparison #2890

naught101 opened this issue Feb 24, 2014 · 8 comments
Labels
Documentation Easy Well-defined and straightforward way to resolve Enhancement good first issue Easy with clear instructions to resolve help wanted

Comments

@naught101
Copy link

The clustering comparison at http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html is somewhat misleading in that the data are totally unlike anything that would be seen in 99% of cases. I realise that they're toy examples, but it would also be good to get something more realistic for comparison.

Here is a simple dataset that has one wide Gaussian distribution, with two smaller Gaussian distributions overlapping it to different extents:

clustering with gaussians

This shows the performance of the various models on more realistic data. It especially shows that DBSCAN isn't perfect ;)

Here's the dataset:

gaussians = np.concatenate([
    multiply(np.random.randn(500, 2), 10),
    add(np.random.randn(500, 2), (10,10)),
    add(np.random.randn(500,2), (-5, -5)),
    ]), None

The parameters are all arbitrary, and the performances for the various algorithms change a fair bit given different parameters, but since this is just to give a rough idea of the relative performances, I don't think that matters that much.

@naught101
Copy link
Author

Also, it would be really cool to list the complexity of each algorithm (and perhaps their sub-algorithms where relevant) in Big O notation, for both number of samples, and number of variables. This would be more informative than the time-taken numbers for the simple toy datasets.

@GaelVaroquaux
Copy link
Member

Pull reqest welcomed for both additions.

@vincentpham1991
Copy link
Contributor

Can I take this up?

@naught101
Copy link
Author

naught101 commented Jan 27, 2017 via email

@vincentpham1991
Copy link
Contributor

I have a PR for the plot, but did not include the second issue which is adding the complexity since I am not confident of the complexity of each algorithm.

@amueller amueller added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Aug 5, 2019
@andreanr
Copy link
Contributor

This Issue can be closed by the MRG #6305 that adds GMM using the sklearn.datasets.make_blobs

@amueller
Copy link
Member

This issue is about adding another dataset, not a method. I actually quite like the dataset here: https://hdbscan.readthedocs.io/en/latest/advanced_hdbscan.html#getting-more-information-about-a-clustering
but something like the one in the original issue is also good

@amueller
Copy link
Member

oh actually fixed, sorry @andreanr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve Enhancement good first issue Easy with clear instructions to resolve help wanted
Projects
None yet
Development

No branches or pull requests

5 participants