[MRG+1] [DOC] Adding GMM to plot_cluster_comparison.py #6305

gte620v · 2016-02-08T05:07:53Z

Changes

Added GMM to plot_cluster_comparison.py
Changed number of clusters in all algorithms to three since one of the datasets clearly has three clusters. As is, it looks like several of the clustering algorithms are unable to identify clearly distinct clusters.

Output

Once the docs are built, the example plot looks like this:

…nents in all algos to 3.

GaelVaroquaux · 2016-02-08T06:58:46Z

Looks great in general.

The idea of setting the number of clusters to 2 was to explore the failure mode of clustering. But I have no strong feeling in this direction.

If we keep it at two, we should at least explain in the docstring why we set it this way. WDYT?

gte620v · 2016-02-08T13:54:25Z

@GaelVaroquaux I don't feel super strongly about it either. It just seems that all of these algos have at least one parameter that needs to be specified and by intentionally mis-tuning the parameter for the algos that require n_clusters as an input and not mis-tuning the parameters for the other algos, we somewhat misrepresent the relative performance.

Another possibility is to set the number of clusters higher than three to see the failure mode in that direction. Or, we could just make two or three plots: with n=2, n=3, and n=4 (this might be the best option).

But yes, either way, I think an explanation would be good. Something like

The results could be improved by tweaking the parameters for each clustering strategy, for instance setting the number of clusters for the methods that needs this parameter specified.

to

The results could be improved by tweaking the parameters for each clustering strategy, for instance setting the number of clusters for the methods that needs this parameter specified. In these examples the number of clusters for the algorithms that require n_clusters as a parameter has been set to two, which is clearly a mismatch with the datasets in the bottom two rows.

Let me know what you think and I am happy to modify the PR.

amueller · 2016-02-09T22:56:32Z

well with 3 the first two plots are a bit more weird. And having three times as many plots seems not very helpful to me.

I feel that having the 2nd or 3rd of these datasets would be more interesting: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html

For the current datasets, there is not that much of a difference between k-means and GMMs.

gte620v · 2016-02-11T18:56:45Z

@amueller Good suggestion.

Adding those two datasets definitely paints a broader picture of how each algo performs.

Let me know what you think and I can push those changes.

jakevdp · 2016-02-18T18:21:19Z

+1 on adding additional datasets

gte620v · 2016-02-26T07:00:47Z

@jakevdp @amueller @GaelVaroquaux

I pushed the changes. The built output looks like this:

tguillemot · 2016-04-22T08:11:38Z

The new GaussianMixture class is merged #6666.
Could you make the changes ?

gte620v · 2016-04-22T14:44:57Z

@tguillemot: I'll take a look.

…nents in all algos to 3.

…m_example

gte620v · 2016-04-22T17:58:47Z

@tguillemot i made the modification. It is ready to merge.

gte620v · 2016-04-25T01:55:04Z

@jakevdp @amueller @GaelVaroquaux let me know if there is anything I else I need to do to make this mergeable.

tguillemot · 2016-04-25T07:16:26Z

examples/cluster/plot_cluster_comparison.py

-
-datasets = [noisy_circles, noisy_moons, blobs, no_structure]
+# noisy_circles, noisy_moons, blobs, no_structure, varied
+datasets = [noisy_circles, noisy_moons, blobs, no_structure, varied,aniso]


Add a space before aniso

It's not a big deal but I think it's better to put no_structure at the end.
It's just to easily compare the result between the gaussian datas.

tguillemot · 2016-04-25T07:45:44Z

@gte620v Thanks for the modification.
I know it is not you who did that but there is some UserWarning during the execution.
Could you try to remove them ?

It's really a good idea to add GMM to that comparison. Thx

… no_structure is at the end.

gte620v · 2016-04-25T14:02:12Z

@tguillemot Thanks for the review!

I made the lint fixes and reorganized the order of the datasets. I am unsure about removing the print(__doc__) line as all the other examples have that line.

The UserWarning errors are due to some change in how random integers should be created. For example:

/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1279: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:630: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:630: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:630: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, self.batch_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, self.batch_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, self.batch_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead

See
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L629-L630
and
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L1327-L1328

I can fix these but it would significantly convolute this PR if I start mucking around with the core k means code. I think it would be best to leave that to another PR. Let me know if you see another way of doing
it.

tguillemot · 2016-04-25T15:42:51Z

Yes, It's caused by a deprecation function in numpy.
I've sent #6712 to correct the problem with rand_int.
But there is some other warning cause by the code :

spectral_embedding_.py:229: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
hierarchical.py:194: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.
hierarchical.py:419: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.

gte620v · 2016-04-27T00:52:23Z

I am not sure what is causing those warnings. They also appear in this example in master and seem to come up in several issues/PRs. e.g.:

#4313
#5327

From master:

scikit-learn/sklearn/manifold/spectral_embedding_.py:229: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn("Graph is not fully connected, spectral embedding"
scikit-learn/sklearn/cluster/hierarchical.py:194: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.
  connectivity, n_components = _fix_connectivity(X, connectivity)
scikit-learn/sklearn/cluster/hierarchical.py:419: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.
  connectivity, n_components = _fix_connectivity(X, connectivity)

I am afraid, I do not know enough about the inner workings of the algos to fix this. Let me know if you have suggestions.

tguillemot · 2016-04-28T09:35:29Z

Thanks you to have tried to solve that.
There is the same problem in master indeed and I think it could be cool to solve it by the same time.
I will investigate that problem next week.

gte620v · 2016-04-29T12:59:28Z

Great; thanks.

gte620v · 2016-07-18T13:24:27Z

Thanks for looking!

@tguillemot I made that islice stop argument an int.

@amueller islice seems to work on 2.7, 3.4, 3.5.
https://docs.python.org/2.7/library/itertools.html#itertools.islice
https://docs.python.org/3.4/library/itertools.html#itertools.islice
https://docs.python.org/3.5/library/itertools.html#itertools.islice

tguillemot · 2016-07-18T14:02:26Z

examples/cluster/plot_cluster_comparison.py

+for i_dataset, (dataset, params) in enumerate(datasets):
+    # update parameters with dataset-specific values
+    defaults = default_base.copy()
+    defaults.update(params)


Rather than creating a new defaults var you can use estimator.set_params where estimator is your estimator.

In thinking about this, I can see how we might move the estimators out of the loop and then update the parameters with set_params, but I don't see how we get rid of the defaults variable without replacing it with something more verbose.

Since not all keywords need to be changed for each dataset, I would have to have another bit of code that checks to see if a keyword that we need is in params or not for that dataset. If it isn't, I'd have to revert back to the default_base values. I think updating the keyword dict defaults, as we have, is the simplest way to have parameters that get changed by dataset-estimator.

The other option is to have the params have both the default_base and dataset-specific values enumerated for each dataset, but with that it is much harder to see where the dataset parameter excursions are.

Let me know if I am missing your point.

@tguillemot any thoughts?

I think it is ugly naming it defaults below when you mean params. Either use params.get('key', default_val) or call it params within the loop.

tguillemot · 2016-07-18T14:03:21Z

LGTM after you correct my comments.

gte620v · 2017-03-11T23:14:48Z

@tguillemot I didn't follow your suggestion above. Can you provide clarification? Otherwise I think it is gtg. If we don't want to merge, that is cool too; we can close the PR.

tguillemot · 2017-03-13T11:24:13Z

@gte620v Sorry for the late answer. What I said was not important but your examples are.

LGTM.

@raghavrv or @TomDLT Can you review this PR and merge it ?

jnothman · 2017-03-13T12:08:48Z

examples/cluster/plot_cluster_comparison.py

-create many clusters. Thus in this example its two parameters
-(damping and per-point preference) were set to mitigate this
-behavior.
+but still in 2D. With the expection of the last dataset,


TomDLT · 2017-03-13T13:19:18Z

examples/cluster/plot_cluster_comparison.py

-        plt.xlim(-2, 2)
-        plt.ylim(-2, 2)
+
+        colors = np.array(list(islice(cycle('bgrcmyk'),


Use ['navy', 'yellowgreen', 'gold', 'cornflowerblue'] to be color-blind compatible.

That looks like this:

cool?

The latest commit makes the above plot.

…ing error

gte620v · 2017-03-13T14:54:22Z

Here is another option for color-blind-friendly colors:

@TomDLT your color cycle seems like it would be problematic for blue–yellow color blindness (this is what is currently in the PR). what do you think?

Here is the original for reference:

TomDLT · 2017-03-13T15:02:32Z

@TomDLT your color cycle seems like it would be problematic for blue–yellow color blindness (this is what is currently in the PR). what do you think?

I have no strong feeling about it. The colors I suggested are used in other examples in scikit-learn, yet if you think this one is better I am fine with it.

…into gmm_example

gte620v · 2017-03-13T16:15:42Z

@TomDLT Ok, great. I picked the first colormap for the latest commit.

I think all is good to merge now.

gte620v · 2017-03-20T14:04:00Z

@jnothman @TomDLT @tguillemot
Are any other changes needed?

jmschrei · 2017-03-21T21:31:21Z

This looks good to me, but I'm going to let one of the people who have been involved in the discussion for longer be the one to merge it. Thanks for the contribution and patience!

TomDLT · 2017-03-21T22:17:51Z

Thanks for the patience @gte620v !

gte620v · 2017-03-21T22:19:43Z

Np. Thanks!!

tguillemot · 2017-03-22T09:03:59Z

Thanks @gte620v

…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly

Adding GMM to plot_cluster_comparison.py and changing number of compo…

71d2689

…nents in all algos to 3.

adding two datasets to clustering comparision example

eba28b0

tguillemot mentioned this pull request Mar 23, 2016

[MRG] Integration of GSoC2015 Gaussian Mixture (first step) #6407

Closed

gte620v added 5 commits April 22, 2016 11:08

Adding GMM to plot_cluster_comparison.py and changing number of compo…

d25c25a

…nents in all algos to 3.

adding two datasets to clustering comparision example

fc76ad0

Merge branch 'gmm_example' of github.com:gte620v/scikit-learn into gm…

fea4e32

…m_example

Merge remote-tracking branch 'upstream/master' into gmm_example

247c662

GMM example using GaussianMixture

23b879f

tguillemot reviewed Apr 25, 2016
View reviewed changes

fixing lint errors; changing order of datasets in the columns so that…

f4fd0fe

… no_structure is at the end.

tguillemot reviewed Jul 18, 2016
View reviewed changes

tguillemot approved these changes Mar 13, 2017

View reviewed changes

jnothman reviewed Mar 13, 2017

View reviewed changes

TomDLT reviewed Mar 13, 2017

View reviewed changes

change default to params, make plot color-blind compatible, fix spell…

0967d72

…ing error

gte620v added 2 commits March 13, 2017 11:12

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

cc3265b

…into gmm_example

new color palette that is more color-blind friendly

8e60e68

TomDLT changed the title ~~[MRG] [DOC] Adding GMM to plot_cluster_comparison.py~~ [MRG+1] [DOC] Adding GMM to plot_cluster_comparison.py Mar 13, 2017

TomDLT merged commit db37017 into scikit-learn:master Mar 21, 2017

andreanr mentioned this pull request Aug 24, 2019

Add a more useful example to the cluster comparison #2890

Closed

thomasjpfan mentioned this pull request Apr 22, 2022

[WIP] added in gaussian plot #8300

Closed

[MRG+1] [DOC] Adding GMM to plot_cluster_comparison.py #6305

[MRG+1] [DOC] Adding GMM to plot_cluster_comparison.py #6305

Conversation

gte620v commented Feb 8, 2016

Changes

Output

GaelVaroquaux commented Feb 8, 2016

gte620v commented Feb 8, 2016

amueller commented Feb 9, 2016

gte620v commented Feb 11, 2016

jakevdp commented Feb 18, 2016

gte620v commented Feb 26, 2016

tguillemot commented Apr 22, 2016

gte620v commented Apr 22, 2016

gte620v commented Apr 22, 2016 • edited Loading

gte620v commented Apr 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tguillemot commented Apr 25, 2016

gte620v commented Apr 25, 2016

tguillemot commented Apr 25, 2016

gte620v commented Apr 27, 2016 • edited Loading

tguillemot commented Apr 28, 2016

gte620v commented Apr 29, 2016

gte620v commented Jul 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tguillemot commented Jul 18, 2016

gte620v commented Mar 11, 2017

tguillemot commented Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gte620v commented Mar 13, 2017

TomDLT commented Mar 13, 2017 • edited Loading

gte620v commented Mar 13, 2017

gte620v commented Mar 20, 2017

jmschrei commented Mar 21, 2017

TomDLT commented Mar 21, 2017

gte620v commented Mar 21, 2017

tguillemot commented Mar 22, 2017

gte620v commented Apr 22, 2016 •

edited

Loading

gte620v commented Apr 27, 2016 •

edited

Loading

tguillemot commented Mar 13, 2017 •

edited

Loading

TomDLT commented Mar 13, 2017 •

edited

Loading