-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] [DOC] Adding GMM to plot_cluster_comparison.py #6305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nents in all algos to 3.
Looks great in general. The idea of setting the number of clusters to 2 was to explore the failure mode of clustering. But I have no strong feeling in this direction. If we keep it at two, we should at least explain in the docstring why we set it this way. WDYT? |
@GaelVaroquaux I don't feel super strongly about it either. It just seems that all of these algos have at least one parameter that needs to be specified and by intentionally mis-tuning the parameter for the algos that require Another possibility is to set the number of clusters higher than three to see the failure mode in that direction. Or, we could just make two or three plots: with n=2, n=3, and n=4 (this might be the best option). But yes, either way, I think an explanation would be good. Something like
to
Let me know what you think and I am happy to modify the PR. |
well with 3 the first two plots are a bit more weird. And having three times as many plots seems not very helpful to me. I feel that having the 2nd or 3rd of these datasets would be more interesting: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html For the current datasets, there is not that much of a difference between k-means and GMMs. |
@amueller Good suggestion. Adding those two datasets definitely paints a broader picture of how each algo performs. Let me know what you think and I can push those changes. |
+1 on adding additional datasets |
@jakevdp @amueller @GaelVaroquaux I pushed the changes. The built output looks like this: |
The new |
@tguillemot: I'll take a look. |
…nents in all algos to 3.
@tguillemot i made the modification. It is ready to merge. |
@jakevdp @amueller @GaelVaroquaux let me know if there is anything I else I need to do to make this mergeable. |
|
||
datasets = [noisy_circles, noisy_moons, blobs, no_structure] | ||
# noisy_circles, noisy_moons, blobs, no_structure, varied | ||
datasets = [noisy_circles, noisy_moons, blobs, no_structure, varied,aniso] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a space before aniso
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big deal but I think it's better to put no_structure at the end.
It's just to easily compare the result between the gaussian datas.
@gte620v Thanks for the modification. It's really a good idea to add GMM to that comparison. Thx |
… no_structure is at the end.
@tguillemot Thanks for the review! I made the lint fixes and reorganized the order of the datasets. I am unsure about removing the The
See I can fix these but it would significantly convolute this PR if I start mucking around with the core k means code. I think it would be best to leave that to another PR. Let me know if you see another way of doing |
Yes, It's caused by a deprecation function in numpy.
|
I am not sure what is causing those warnings. They also appear in this example in From
I am afraid, I do not know enough about the inner workings of the algos to fix this. Let me know if you have suggestions. |
Thanks you to have tried to solve that. |
Great; thanks. |
Thanks for looking! @tguillemot I made that islice @amueller islice seems to work on 2.7, 3.4, 3.5. |
for i_dataset, (dataset, params) in enumerate(datasets): | ||
# update parameters with dataset-specific values | ||
defaults = default_base.copy() | ||
defaults.update(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a new defaults var you can use estimator.set_params
where estimator is your estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thinking about this, I can see how we might move the estimators out of the loop and then update the parameters with set_params
, but I don't see how we get rid of the defaults variable without replacing it with something more verbose.
Since not all keywords need to be changed for each dataset, I would have to have another bit of code that checks to see if a keyword that we need is in params
or not for that dataset. If it isn't, I'd have to revert back to the default_base
values. I think updating the keyword dict defaults
, as we have, is the simplest way to have parameters that get changed by dataset-estimator.
The other option is to have the params
have both the default_base
and dataset-specific values enumerated for each dataset, but with that it is much harder to see where the dataset parameter excursions are.
Let me know if I am missing your point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tguillemot any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ugly naming it defaults
below when you mean params
. Either use params.get('key', default_val)
or call it params
within the loop.
LGTM after you correct my comments. |
@tguillemot I didn't follow your suggestion above. Can you provide clarification? Otherwise I think it is gtg. If we don't want to merge, that is cool too; we can close the PR. |
create many clusters. Thus in this example its two parameters | ||
(damping and per-point preference) were set to mitigate this | ||
behavior. | ||
but still in 2D. With the expection of the last dataset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
plt.xlim(-2, 2) | ||
plt.ylim(-2, 2) | ||
|
||
colors = np.array(list(islice(cycle('bgrcmyk'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ['navy', 'yellowgreen', 'gold', 'cornflowerblue']
to be color-blind compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit makes the above plot.
Here is another option for color-blind-friendly colors: @TomDLT your color cycle seems like it would be problematic for blue–yellow color blindness (this is what is currently in the PR). what do you think? Here is the original for reference: |
I have no strong feeling about it. The colors I suggested are used in other examples in scikit-learn, yet if you think this one is better I am fine with it. |
@TomDLT Ok, great. I picked the first colormap for the latest commit. I think all is good to merge now. |
@jnothman @TomDLT @tguillemot |
This looks good to me, but I'm going to let one of the people who have been involved in the discussion for longer be the one to merge it. Thanks for the contribution and patience! |
Thanks for the patience @gte620v ! |
Np. Thanks!! |
Thanks @gte620v |
…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly
…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly
…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly
…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly
…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly
…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly
…6305) * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * Adding GMM to plot_cluster_comparison.py and changing number of components in all algos to 3. * adding two datasets to clustering comparision example * GMM example using GaussianMixture * fixing lint errors; changing order of datasets in the columns so that no_structure is at the end. * adding warning supression. * fixing warning supression. * hand-tuned cluster parameters * moved list of algo names; cleaning up color cycling * fixing islice stop to be an int * change default to params, make plot color-blind compatible, fix spelling error * new color palette that is more color-blind friendly
Changes
Output
Once the docs are built, the example plot looks like this:
