-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add an example of inductive clustering #10852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ping @jnothman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to make Circle CI green first (you might refer to Circle CI log and other examples).
""" | ||
============================================== | ||
Inductive Clustering | ||
============================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A blank line here I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, it might be that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this example.
I just have some nitpicks, especially concerning the bad plot rendering.
Once nitpicks are addressed, I am +1 to merge this
self.classifier_.fit(X, y) | ||
return self | ||
|
||
@if_delegate_has_method(delegate='classifier') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be with an underscore: @if_delegate_has_method(delegate='classifier_')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
def predict(self, X): | ||
return self.classifier_.predict(X) | ||
|
||
@if_delegate_has_method(delegate='classifier') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return self.classifier_.decision_function(X) | ||
|
||
|
||
def plot_scatter(X, color, alpha=0.5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove double whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was on purpose. Shouldn't we have double space between two functions as per PEP8?
plt.subplot(133) | ||
plot_scatter(X, cluster_labels) | ||
plot_scatter(X_new, probable_clusters) | ||
plt.title("Inductive inference on cluster membership \n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title is too long for the plot.
Please make sure the final plot is readable with the chosen figure size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shorted the titles
clusterer = AgglomerativeClustering(n_clusters=3) | ||
cluster_labels = clusterer.fit_predict(X) | ||
|
||
plt.subplot(131) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add before this line plt.figure(figsize=(12, 4))
to specify a figure size.
(and make sure the size is good)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did the trick!
# Declare the inductive learning model that it will be used to | ||
# predict cluster membership for unknown instances | ||
classifier = RandomForestClassifier(random_state=RANDOM_STATE) | ||
inductiveLearner = InductiveClusterer(clusterer, classifier).fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use underscores, not camelCase for local variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course
|
||
|
||
# Generate new samples and plot them along with the original dataset | ||
X_new, y_new = make_blobs(n_samples=10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Should we be drawing samples from a completely different distribution, rather than drawing a test set from the same generation procedure (or even real-world data)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion about that. I think that the intention of the example is clearly provided. If you want something more specific I am all ears.
Clustering is expensive, especially when our dataset contains millions of | ||
datapoints. Recomputing the clusters everytime we receive some new data | ||
is thus in many cases, intractable. With more data, there is also the | ||
possibility of degrading the previous clustering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is less of an issue with degrading, than with identifying the clusters across two clusterings.
For that reason and others, this kind of technique is interesting regardless of the size of the dataset. An algorithm like agglomerative clustering or dbscan makes no hypothesis about how to divide the data in terms of features. Learning a classifier may also help us make inferences about the nature of the clustering. For this reason, I think we should aim to plot the decision boundary in the plot below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that reason and others, this kind of technique is interesting regardless of the size of the dataset.
@jnothman I agree. I kept the docstring from the original PR.
An algorithm like agglomerative clustering or dbscan makes no hypothesis about how to divide the data in terms of features. Learning a classifier may also help us make inferences about the nature of the clustering. For this reason, I think we should aim to plot the decision boundary in the plot below
I agree again. I have plotted the decision regions in the third plot. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision regions are helpful. Please also update the description to better match real use cases.
One problem we have here is that Broadly, I like this example, I just wish it was clearer on the inferential value of such an approach, not merely its application to new data. |
I've decided this is a nice example of both the technique and meta-estimator design, and would like to merge it. |
@jnothman great! |
Thanks @chkoar! Nice to clear out some cobwebs. |
This reverts commit 534090c.
This reverts commit 534090c.
Reference Issues/PRs
Resolves #4587. Continues #6478.
What does this implement/fix? Explain your changes.
This PR adds an example about how to implement and perform inductive inference on cluster memberships by using a classifier that is trained on cluster labels.