-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add an example of inductive clustering #10852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2ff7a38
3bdaacb
a7e0f36
d9507b9
7ae89c8
090bbfe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
""" | ||
============================================== | ||
Inductive Clustering | ||
============================================== | ||
|
||
Clustering can be expensive, especially when our dataset contains millions | ||
of datapoints. Many clustering algorithms are not :term:`inductive` and so | ||
cannot be directly applied to new data samples without recomputing the | ||
clustering, which may be intractable. Instead, we can use clustering to then | ||
learn an inductive model with a classifier, which has several benefits: | ||
|
||
- it allows the clusters to scale and apply to new data | ||
- unlike re-fitting the clusters to new samples, it makes sure the labelling | ||
procedure is consistent over time | ||
- it allows us to use the inferential capabilities of the classifier to | ||
describe or explain the clusters | ||
|
||
This example illustrates a generic implementation of a meta-estimator which | ||
extends clustering by inducing a classifier from the cluster labels. | ||
""" | ||
print(__doc__) | ||
|
||
import numpy as np | ||
import matplotlib.pyplot as plt | ||
from sklearn.base import BaseEstimator, clone | ||
from sklearn.cluster import AgglomerativeClustering | ||
from sklearn.datasets import make_blobs | ||
from sklearn.ensemble import RandomForestClassifier | ||
from sklearn.utils.metaestimators import if_delegate_has_method | ||
|
||
|
||
N_SAMPLES = 5000 | ||
RANDOM_STATE = 42 | ||
|
||
|
||
class InductiveClusterer(BaseEstimator): | ||
def __init__(self, clusterer, classifier): | ||
self.clusterer = clusterer | ||
self.classifier = classifier | ||
|
||
def fit(self, X, y=None): | ||
self.clusterer_ = clone(self.clusterer) | ||
self.classifier_ = clone(self.classifier) | ||
y = self.clusterer_.fit_predict(X) | ||
self.classifier_.fit(X, y) | ||
return self | ||
|
||
@if_delegate_has_method(delegate='classifier_') | ||
def predict(self, X): | ||
return self.classifier_.predict(X) | ||
|
||
@if_delegate_has_method(delegate='classifier_') | ||
def decision_function(self, X): | ||
return self.classifier_.decision_function(X) | ||
|
||
|
||
def plot_scatter(X, color, alpha=0.5): | ||
return plt.scatter(X[:, 0], | ||
X[:, 1], | ||
c=color, | ||
alpha=alpha, | ||
edgecolor='k') | ||
|
||
|
||
# Generate some training data from clustering | ||
X, y = make_blobs(n_samples=N_SAMPLES, | ||
cluster_std=[1.0, 1.0, 0.5], | ||
centers=[(-5, -5), (0, 0), (5, 5)], | ||
random_state=RANDOM_STATE) | ||
|
||
|
||
# Train a clustering algorithm on the training data and get the cluster labels | ||
clusterer = AgglomerativeClustering(n_clusters=3) | ||
cluster_labels = clusterer.fit_predict(X) | ||
|
||
plt.figure(figsize=(12, 4)) | ||
|
||
plt.subplot(131) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add before this line There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It did the trick! |
||
plot_scatter(X, cluster_labels) | ||
plt.title("Ward Linkage") | ||
|
||
|
||
# Generate new samples and plot them along with the original dataset | ||
X_new, y_new = make_blobs(n_samples=10, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm... Should we be drawing samples from a completely different distribution, rather than drawing a test set from the same generation procedure (or even real-world data)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion about that. I think that the intention of the example is clearly provided. If you want something more specific I am all ears. |
||
centers=[(-7, -1), (-2, 4), (3, 6)], | ||
random_state=RANDOM_STATE) | ||
|
||
plt.subplot(132) | ||
plot_scatter(X, cluster_labels) | ||
plot_scatter(X_new, 'black', 1) | ||
plt.title("Unknown instances") | ||
|
||
|
||
# Declare the inductive learning model that it will be used to | ||
# predict cluster membership for unknown instances | ||
classifier = RandomForestClassifier(random_state=RANDOM_STATE) | ||
inductive_learner = InductiveClusterer(clusterer, classifier).fit(X) | ||
|
||
probable_clusters = inductive_learner.predict(X_new) | ||
|
||
|
||
plt.subplot(133) | ||
plot_scatter(X, cluster_labels) | ||
plot_scatter(X_new, probable_clusters) | ||
|
||
# Plotting decision regions | ||
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 | ||
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 | ||
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), | ||
np.arange(y_min, y_max, 0.1)) | ||
|
||
Z = inductive_learner.predict(np.c_[xx.ravel(), yy.ravel()]) | ||
Z = Z.reshape(xx.shape) | ||
|
||
plt.contourf(xx, yy, Z, alpha=0.4) | ||
plt.title("Classify unknown instances") | ||
|
||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove double whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was on purpose. Shouldn't we have double space between two functions as per PEP8?