Skip to content

[MRG] Add an example of inductive clustering #10852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 17, 2019
Merged

Conversation

chkoar
Copy link
Contributor

@chkoar chkoar commented Mar 21, 2018

Reference Issues/PRs

Resolves #4587. Continues #6478.

What does this implement/fix? Explain your changes.

This PR adds an example about how to implement and perform inductive inference on cluster memberships by using a classifier that is trained on cluster labels.

@chkoar
Copy link
Contributor Author

chkoar commented Mar 21, 2018

ping @jnothman

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to make Circle CI green first (you might refer to Circle CI log and other examples).

"""
==============================================
Inductive Clustering
==============================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A blank line here I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, it might be that.

@chkoar chkoar changed the title Add an example of inductive clustering [MRG] Add an example of inductive clustering Apr 2, 2018
Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this example.
I just have some nitpicks, especially concerning the bad plot rendering.

Once nitpicks are addressed, I am +1 to merge this

self.classifier_.fit(X, y)
return self

@if_delegate_has_method(delegate='classifier')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be with an underscore: @if_delegate_has_method(delegate='classifier_')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def predict(self, X):
return self.classifier_.predict(X)

@if_delegate_has_method(delegate='classifier')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return self.classifier_.decision_function(X)


def plot_scatter(X, color, alpha=0.5):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove double whitespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was on purpose. Shouldn't we have double space between two functions as per PEP8?

plt.subplot(133)
plot_scatter(X, cluster_labels)
plot_scatter(X_new, probable_clusters)
plt.title("Inductive inference on cluster membership \n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title is too long for the plot.
Please make sure the final plot is readable with the chosen figure size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shorted the titles

clusterer = AgglomerativeClustering(n_clusters=3)
cluster_labels = clusterer.fit_predict(X)

plt.subplot(131)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add before this line plt.figure(figsize=(12, 4)) to specify a figure size.
(and make sure the size is good)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did the trick!

@chkoar
Copy link
Contributor Author

chkoar commented Apr 15, 2018

@TomDLT thanks for the feedback. The rendering of the plot I believe that it seems ok now. The titles of the plots could be better, though.

# Declare the inductive learning model that it will be used to
# predict cluster membership for unknown instances
classifier = RandomForestClassifier(random_state=RANDOM_STATE)
inductiveLearner = InductiveClusterer(clusterer, classifier).fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use underscores, not camelCase for local variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course



# Generate new samples and plot them along with the original dataset
X_new, y_new = make_blobs(n_samples=10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Should we be drawing samples from a completely different distribution, rather than drawing a test set from the same generation procedure (or even real-world data)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about that. I think that the intention of the example is clearly provided. If you want something more specific I am all ears.

Clustering is expensive, especially when our dataset contains millions of
datapoints. Recomputing the clusters everytime we receive some new data
is thus in many cases, intractable. With more data, there is also the
possibility of degrading the previous clustering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is less of an issue with degrading, than with identifying the clusters across two clusterings.

For that reason and others, this kind of technique is interesting regardless of the size of the dataset. An algorithm like agglomerative clustering or dbscan makes no hypothesis about how to divide the data in terms of features. Learning a classifier may also help us make inferences about the nature of the clustering. For this reason, I think we should aim to plot the decision boundary in the plot below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that reason and others, this kind of technique is interesting regardless of the size of the dataset.

@jnothman I agree. I kept the docstring from the original PR.

An algorithm like agglomerative clustering or dbscan makes no hypothesis about how to divide the data in terms of features. Learning a classifier may also help us make inferences about the nature of the clustering. For this reason, I think we should aim to plot the decision boundary in the plot below

I agree again. I have plotted the decision regions in the third plot. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision regions are helpful. Please also update the description to better match real use cases.

@jnothman
Copy link
Member

One problem we have here is that if_delegate_has_method is a bit obscure, and currently is not hyperlinked to its documentation because no such documentation page exists. If we use if_delegate_has_method in the example, we must also make sure it's present in doc/modules/classes.rst.

Broadly, I like this example, I just wish it was clearer on the inferential value of such an approach, not merely its application to new data.

@jnothman
Copy link
Member

I've decided this is a nice example of both the technique and meta-estimator design, and would like to merge it.

@chkoar
Copy link
Contributor Author

chkoar commented Jan 17, 2019

@jnothman great!

@jnothman jnothman merged commit d2a77d7 into scikit-learn:master Jan 17, 2019
@jnothman
Copy link
Member

Thanks @chkoar! Nice to clear out some cobwebs.

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example of inductive clustering
4 participants