Skip to content

[MRG+1] [DOC] Adding GMM to plot_cluster_comparison.py #6305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 21, 2017

Conversation

gte620v
Copy link
Contributor

@gte620v gte620v commented Feb 8, 2016

Changes

  • Added GMM to plot_cluster_comparison.py
  • Changed number of clusters in all algorithms to three since one of the datasets clearly has three clusters. As is, it looks like several of the clustering algorithms are unable to identify clearly distinct clusters.

Output

Once the docs are built, the example plot looks like this:
image

@GaelVaroquaux
Copy link
Member

Looks great in general.

The idea of setting the number of clusters to 2 was to explore the failure mode of clustering. But I have no strong feeling in this direction.

If we keep it at two, we should at least explain in the docstring why we set it this way. WDYT?

@gte620v
Copy link
Contributor Author

gte620v commented Feb 8, 2016

@GaelVaroquaux I don't feel super strongly about it either. It just seems that all of these algos have at least one parameter that needs to be specified and by intentionally mis-tuning the parameter for the algos that require n_clusters as an input and not mis-tuning the parameters for the other algos, we somewhat misrepresent the relative performance.

Another possibility is to set the number of clusters higher than three to see the failure mode in that direction. Or, we could just make two or three plots: with n=2, n=3, and n=4 (this might be the best option).

But yes, either way, I think an explanation would be good. Something like

The results could be improved by tweaking the parameters for each clustering strategy, for instance setting the number of clusters for the methods that needs this parameter specified.

to

The results could be improved by tweaking the parameters for each clustering strategy, for instance setting the number of clusters for the methods that needs this parameter specified. In these examples the number of clusters for the algorithms that require n_clusters as a parameter has been set to two, which is clearly a mismatch with the datasets in the bottom two rows.

Let me know what you think and I am happy to modify the PR.

@amueller
Copy link
Member

amueller commented Feb 9, 2016

well with 3 the first two plots are a bit more weird. And having three times as many plots seems not very helpful to me.

I feel that having the 2nd or 3rd of these datasets would be more interesting: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html

For the current datasets, there is not that much of a difference between k-means and GMMs.

@gte620v
Copy link
Contributor Author

gte620v commented Feb 11, 2016

@amueller Good suggestion.

Adding those two datasets definitely paints a broader picture of how each algo performs.

Let me know what you think and I can push those changes.

image

@jakevdp
Copy link
Member

jakevdp commented Feb 18, 2016

+1 on adding additional datasets

@gte620v
Copy link
Contributor Author

gte620v commented Feb 26, 2016

@jakevdp @amueller @GaelVaroquaux

I pushed the changes. The built output looks like this:

image

@tguillemot
Copy link
Contributor

The new GaussianMixture class is merged #6666.
Could you make the changes ?

@gte620v
Copy link
Contributor Author

gte620v commented Apr 22, 2016

@tguillemot: I'll take a look.

@gte620v
Copy link
Contributor Author

gte620v commented Apr 22, 2016

@tguillemot i made the modification. It is ready to merge.

@gte620v
Copy link
Contributor Author

gte620v commented Apr 25, 2016

@jakevdp @amueller @GaelVaroquaux let me know if there is anything I else I need to do to make this mergeable.


datasets = [noisy_circles, noisy_moons, blobs, no_structure]
# noisy_circles, noisy_moons, blobs, no_structure, varied
datasets = [noisy_circles, noisy_moons, blobs, no_structure, varied,aniso]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space before aniso

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal but I think it's better to put no_structure at the end.
It's just to easily compare the result between the gaussian datas.

@tguillemot
Copy link
Contributor

@gte620v Thanks for the modification.
I know it is not you who did that but there is some UserWarning during the execution.
Could you try to remove them ?

It's really a good idea to add GMM to that comparison. Thx

@gte620v
Copy link
Contributor Author

gte620v commented Apr 25, 2016

@tguillemot Thanks for the review!

I made the lint fixes and reorganized the order of the datasets. I am unsure about removing the print(__doc__) line as all the other examples have that line.

The UserWarning errors are due to some change in how random integers should be created. For example:

/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1279: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:630: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:630: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:630: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, init_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, self.batch_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, self.batch_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead
  0, n_samples - 1, self.batch_size)
/Users/bob/code/scikit-learn/sklearn/cluster/k_means_.py:1328: DeprecationWarning: This function is deprecated. Please call randint(0, 1499 + 1) instead

See
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L629-L630
and
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L1327-L1328

I can fix these but it would significantly convolute this PR if I start mucking around with the core k means code. I think it would be best to leave that to another PR. Let me know if you see another way of doing
it.

@tguillemot
Copy link
Contributor

Yes, It's caused by a deprecation function in numpy.
I've sent #6712 to correct the problem with rand_int.
But there is some other warning cause by the code :

spectral_embedding_.py:229: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
hierarchical.py:194: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.
hierarchical.py:419: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.

@gte620v
Copy link
Contributor Author

gte620v commented Apr 27, 2016

I am not sure what is causing those warnings. They also appear in this example in master and seem to come up in several issues/PRs. e.g.:

#4313
#5327

From master:

scikit-learn/sklearn/manifold/spectral_embedding_.py:229: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.
  warnings.warn("Graph is not fully connected, spectral embedding"
scikit-learn/sklearn/cluster/hierarchical.py:194: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.
  connectivity, n_components = _fix_connectivity(X, connectivity)
scikit-learn/sklearn/cluster/hierarchical.py:419: UserWarning: the number of connected components of the connectivity matrix is 2 > 1. Completing it to avoid stopping the tree early.
  connectivity, n_components = _fix_connectivity(X, connectivity)

I am afraid, I do not know enough about the inner workings of the algos to fix this. Let me know if you have suggestions.

@tguillemot
Copy link
Contributor

Thanks you to have tried to solve that.
There is the same problem in master indeed and I think it could be cool to solve it by the same time.
I will investigate that problem next week.

@gte620v
Copy link
Contributor Author

gte620v commented Apr 29, 2016

Great; thanks.

@gte620v
Copy link
Contributor Author

gte620v commented Jul 18, 2016

for i_dataset, (dataset, params) in enumerate(datasets):
# update parameters with dataset-specific values
defaults = default_base.copy()
defaults.update(params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a new defaults var you can use estimator.set_params where estimator is your estimator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thinking about this, I can see how we might move the estimators out of the loop and then update the parameters with set_params, but I don't see how we get rid of the defaults variable without replacing it with something more verbose.

Since not all keywords need to be changed for each dataset, I would have to have another bit of code that checks to see if a keyword that we need is in params or not for that dataset. If it isn't, I'd have to revert back to the default_base values. I think updating the keyword dict defaults, as we have, is the simplest way to have parameters that get changed by dataset-estimator.

The other option is to have the params have both the default_base and dataset-specific values enumerated for each dataset, but with that it is much harder to see where the dataset parameter excursions are.

Let me know if I am missing your point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tguillemot any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ugly naming it defaults below when you mean params. Either use params.get('key', default_val) or call it params within the loop.

@tguillemot
Copy link
Contributor

LGTM after you correct my comments.

@gte620v
Copy link
Contributor Author

gte620v commented Mar 11, 2017

@tguillemot I didn't follow your suggestion above. Can you provide clarification? Otherwise I think it is gtg. If we don't want to merge, that is cool too; we can close the PR.

@tguillemot
Copy link
Contributor

tguillemot commented Mar 13, 2017

@gte620v Sorry for the late answer. What I said was not important but your examples are.

LGTM.

@raghavrv or @TomDLT Can you review this PR and merge it ?

create many clusters. Thus in this example its two parameters
(damping and per-point preference) were set to mitigate this
behavior.
but still in 2D. With the expection of the last dataset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

plt.xlim(-2, 2)
plt.ylim(-2, 2)

colors = np.array(list(islice(cycle('bgrcmyk'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ['navy', 'yellowgreen', 'gold', 'cornflowerblue'] to be color-blind compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like this:

image

cool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit makes the above plot.

@gte620v
Copy link
Contributor Author

gte620v commented Mar 13, 2017

Here is another option for color-blind-friendly colors:

image

@TomDLT your color cycle seems like it would be problematic for blue–yellow color blindness (this is what is currently in the PR). what do you think?

image

Here is the original for reference:

image

@TomDLT
Copy link
Member

TomDLT commented Mar 13, 2017

@TomDLT your color cycle seems like it would be problematic for blue–yellow color blindness (this is what is currently in the PR). what do you think?

I have no strong feeling about it. The colors I suggested are used in other examples in scikit-learn, yet if you think this one is better I am fine with it.

@gte620v
Copy link
Contributor Author

gte620v commented Mar 13, 2017

@TomDLT Ok, great. I picked the first colormap for the latest commit.

I think all is good to merge now.

@TomDLT TomDLT changed the title [MRG] [DOC] Adding GMM to plot_cluster_comparison.py [MRG+1] [DOC] Adding GMM to plot_cluster_comparison.py Mar 13, 2017
@gte620v
Copy link
Contributor Author

gte620v commented Mar 20, 2017

@jnothman @TomDLT @tguillemot
Are any other changes needed?

@jmschrei
Copy link
Member

This looks good to me, but I'm going to let one of the people who have been involved in the discussion for longer be the one to merge it. Thanks for the contribution and patience!

@TomDLT TomDLT merged commit db37017 into scikit-learn:master Mar 21, 2017
@TomDLT
Copy link
Member

TomDLT commented Mar 21, 2017

Thanks for the patience @gte620v !

@gte620v
Copy link
Contributor Author

gte620v commented Mar 21, 2017

Np. Thanks!!

@tguillemot
Copy link
Contributor

Thanks @gte620v

herilalaina pushed a commit to herilalaina/scikit-learn that referenced this pull request Mar 26, 2017
…6305)

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* GMM example using GaussianMixture

* fixing lint errors; changing order of datasets in the columns so that no_structure is at the end.

* adding warning supression.

* fixing warning supression.

* hand-tuned cluster parameters

* moved list of algo names; cleaning up color cycling

* fixing islice stop to be an int

* change default to params, make plot color-blind compatible, fix spelling error

* new color palette that is more color-blind friendly
massich pushed a commit to massich/scikit-learn that referenced this pull request Apr 26, 2017
…6305)

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* GMM example using GaussianMixture

* fixing lint errors; changing order of datasets in the columns so that no_structure is at the end.

* adding warning supression.

* fixing warning supression.

* hand-tuned cluster parameters

* moved list of algo names; cleaning up color cycling

* fixing islice stop to be an int

* change default to params, make plot color-blind compatible, fix spelling error

* new color palette that is more color-blind friendly
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…6305)

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* GMM example using GaussianMixture

* fixing lint errors; changing order of datasets in the columns so that no_structure is at the end.

* adding warning supression.

* fixing warning supression.

* hand-tuned cluster parameters

* moved list of algo names; cleaning up color cycling

* fixing islice stop to be an int

* change default to params, make plot color-blind compatible, fix spelling error

* new color palette that is more color-blind friendly
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…6305)

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* GMM example using GaussianMixture

* fixing lint errors; changing order of datasets in the columns so that no_structure is at the end.

* adding warning supression.

* fixing warning supression.

* hand-tuned cluster parameters

* moved list of algo names; cleaning up color cycling

* fixing islice stop to be an int

* change default to params, make plot color-blind compatible, fix spelling error

* new color palette that is more color-blind friendly
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…6305)

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* GMM example using GaussianMixture

* fixing lint errors; changing order of datasets in the columns so that no_structure is at the end.

* adding warning supression.

* fixing warning supression.

* hand-tuned cluster parameters

* moved list of algo names; cleaning up color cycling

* fixing islice stop to be an int

* change default to params, make plot color-blind compatible, fix spelling error

* new color palette that is more color-blind friendly
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…6305)

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* GMM example using GaussianMixture

* fixing lint errors; changing order of datasets in the columns so that no_structure is at the end.

* adding warning supression.

* fixing warning supression.

* hand-tuned cluster parameters

* moved list of algo names; cleaning up color cycling

* fixing islice stop to be an int

* change default to params, make plot color-blind compatible, fix spelling error

* new color palette that is more color-blind friendly
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…6305)

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3.

* adding two datasets to clustering comparision example

* GMM example using GaussianMixture

* fixing lint errors; changing order of datasets in the columns so that no_structure is at the end.

* adding warning supression.

* fixing warning supression.

* hand-tuned cluster parameters

* moved list of algo names; cleaning up color cycling

* fixing islice stop to be an int

* change default to params, make plot color-blind compatible, fix spelling error

* new color palette that is more color-blind friendly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants