-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add a more useful example to the cluster comparison #2890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also, it would be really cool to list the complexity of each algorithm (and perhaps their sub-algorithms where relevant) in Big O notation, for both number of samples, and number of variables. This would be more informative than the time-taken numbers for the simple toy datasets. |
Pull reqest welcomed for both additions. |
Can I take this up? |
Sure, go for it. Can't remember why I dropped the ball on this... Let me know if I can be of use.
…On 28 January 2017 5:09:12 AM AEDT, Vincent Pham ***@***.***> wrote:
Can I take this up?
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#2890 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I have a PR for the plot, but did not include the second issue which is adding the complexity since I am not confident of the complexity of each algorithm. |
This Issue can be closed by the MRG #6305 that adds GMM using the sklearn.datasets.make_blobs |
This issue is about adding another dataset, not a method. I actually quite like the dataset here: https://hdbscan.readthedocs.io/en/latest/advanced_hdbscan.html#getting-more-information-about-a-clustering |
oh actually fixed, sorry @andreanr |
The clustering comparison at http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html is somewhat misleading in that the data are totally unlike anything that would be seen in 99% of cases. I realise that they're toy examples, but it would also be good to get something more realistic for comparison.
Here is a simple dataset that has one wide Gaussian distribution, with two smaller Gaussian distributions overlapping it to different extents:
This shows the performance of the various models on more realistic data. It especially shows that DBSCAN isn't perfect ;)
Here's the dataset:
The parameters are all arbitrary, and the performances for the various algorithms change a fair bit given different parameters, but since this is just to give a rough idea of the relative performances, I don't think that matters that much.
The text was updated successfully, but these errors were encountered: