Pipeline on labels_ instead of just transforms #4543

selwyth · 2015-04-08T00:06:38Z

My current use case is to cluster a bunch of data to generate labels, then use a classifier to draw decision boundaries and make label predictions on new data.

I love using pipeline and gridsearch packages in tandem, but can't use it for this because clustering does not have a transform method for X. Is it worthwhile to make pipeline components chain-able by the labels_ attribute in addition to transform?

jnothman · 2015-04-08T01:01:13Z

Some clusterers can be used as transformers, e.g. centroid distance in
KMeans. Can you be more explicit as to what the transformed feature space
looks like? In any case, this functionality can be provided by inheriting
from an existing clusterer and a mixin that provides transform.

On 8 April 2015 at 10:06, selwyth notifications@github.com wrote:

My current use case is to cluster a bunch of data to generate labels,
then use a classifier to draw decision boundaries and make label
predictions on new data.

I love using pipeline and gridsearch packages in tandem, but can't use it
for this because clustering does not have a transform method for X. Is it
worthwhile to make pipeline components chain-able by the labels_ attribute
in addition to transform?

—
Reply to this email directly or view it on GitHub
#4543.

amueller · 2015-04-08T16:17:11Z

I'm not entirely sure I understand the question. Do you want the labels_ to be the X or the y? You want it to be the labels, right? then this is duplicate of #4143, right?

selwyth · 2015-04-08T16:51:13Z

Yes, want labels_ to be y. It's very similar to #4143 -- that one proposes transforms to existing y; I'm thinking of generating a y where there isn't one previously, then sending it down the pipeline.

Trying to be more explicit here: say I want to classify the iris dataset but don't have the labels to train it on. I would like to cluster the data (say using KMeans), then use the labels_ as the y for a classifier (say kNN) on out-of-sample data. Would like to put these in a pipeline so I can tune the hyperparameters for both steps (e.g. both n_clusters and n_neighbors). Basically, I'm wondering if it's worthwhile to be able to pipeline the conversion of an unsupervised learning problem into a supervised learning problem. Hope that helps.

amueller · 2015-04-08T17:52:12Z

I can see what you are trying to do, and it will probably be possible soonish.
I don't know if it makes sense, though. If you evaluate on the final classification, your result will be best if you clustered all points in the same class, as that makes the task for the classifier trivial.

jnothman · 2015-04-09T00:03:45Z

Right. I get what you're doing now. It's not hard to construct a meta-estimator that does this, and I think it's a fairly common way of making clusterers inductive.

I've not tested it, but:

from sklearn.base import clone, BaseEstimator
from sklearn.utils.metaestimators import if_delegate_has_method

class InductiveClusterer(BaseEstimator):
    def __init__(self, clusterer, classifier):
        self.clusterer = clusterer
        self.classifier = classifier

    def fit(self, X, y=None):
        self.clusterer_ = clone(self.clusterer)
        self.classifier_ = clone(self.classifier)
        y = self.clusterer_.fit_predict(X)
        self.classifier_.fit(X, y)
        return self

    @if_delegate_has_method(delegate='classifier_')
    def predict(self, X):
        return self.classifier_.predict(X)

    @if_delegate_has_method(delegate='classifier_')
    def decision_function(self, X):
        return self.classifier_.decision_function(X)

    # etc...

@amueller

If you evaluate on the final classification, your result will be best if you clustered all points in the same class, as that makes the task for the classifier trivial.

Not if it's a clustering evaluation....?

jnothman · 2015-04-09T00:05:11Z

PS: with clusterer=DBSCAN() and classifier=KNearestNeighbors() you can set parameters such as classifier__n_neighbors=3 and clusterer__eps=.5.

jnothman · 2015-04-09T00:06:42Z

Also, @amueller, this could be used to learn a transformed feature space, an L1 feature selector etc that are informed by the clustering, rather than merely for the purpose of clustering.

jnothman · 2015-04-09T00:09:25Z

@selwyth, if you find the above code snippet useful, I think it would make a helpful example for sklearn's documentation. Feel free to flesh out an illustrative example of the technique using a shared dataset and submit a PR.

selwyth · 2015-04-09T10:24:35Z

@amueller That's a good point. Perhaps have to warn user to limit searches to higher n_clusters, or use inertia_ (at least in the case of KMeans) as a penalty parameter to discourage low numbers of clusters that trivialize the classifier's task.

@jnothman Thank you for the snippet. I understand it, will study it some more, think through the other methods in the 'etc' and test before submitting a PR for documentation on this example. Forgot to mention the feature space is different for both steps in my example, but I see how to easily work that into your meta-estimator.

amueller · 2015-04-09T13:28:02Z

Ah, I misunderstood the setting a bit. If you use an unsupervised clustering metric in the end, it makes sense.

amueller · 2015-04-09T13:29:08Z

I guess the point is then to add a predict to a clustering algorithm that otherwise wouldn't have one. That makes sense.

jnothman · 2015-04-11T10:56:54Z

Forgot to mention the feature space is different for both steps in my example, but I see how to easily work that into your meta-estimator.

Well, for the feature space to be different, you'll almost certainly need custom code. However, you could do that with both the clusterer and the classifier being a Pipeline incorporating feature extraction / transformation. However, currently Pipeline doesn't support fit_predict. New issue at #4572

jnothman · 2015-04-13T23:12:37Z

This issue has helped identify #4572, which has since been fixed! But otherwise, I don't think we need heavy internal changes to support this case, but rather should either add the likes of InductiveClusterer either in the project, or as an example. If you agree, @amueller, we should make specific issues for whichever of those we see appropriate, and close this issue.

jnothman · 2015-04-13T23:15:04Z

Ah, rereading, I see that @selwyth intends to submit a PR. I look forward to it. Regarding "think through the other methods in the 'etc'", see the set of methods delegated by Pipeline or BaseSearchCV.

amueller · 2015-04-13T23:16:42Z

Feel free to close.

jnothman mentioned this issue Apr 13, 2015

Add example of inductive clustering #4587

Closed

jnothman closed this as completed Apr 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline on labels_ instead of just transforms #4543

Pipeline on labels_ instead of just transforms #4543

selwyth commented Apr 8, 2015

jnothman commented Apr 8, 2015

amueller commented Apr 8, 2015

selwyth commented Apr 8, 2015

amueller commented Apr 8, 2015

jnothman commented Apr 9, 2015 •

edited by TomDLT

Loading

jnothman commented Apr 9, 2015

jnothman commented Apr 9, 2015

jnothman commented Apr 9, 2015

selwyth commented Apr 9, 2015

amueller commented Apr 9, 2015

amueller commented Apr 9, 2015

jnothman commented Apr 11, 2015

jnothman commented Apr 13, 2015

jnothman commented Apr 13, 2015

amueller commented Apr 13, 2015

Pipeline on labels_ instead of just transforms #4543

Pipeline on labels_ instead of just transforms #4543

Comments

selwyth commented Apr 8, 2015

jnothman commented Apr 8, 2015

amueller commented Apr 8, 2015

selwyth commented Apr 8, 2015

amueller commented Apr 8, 2015

jnothman commented Apr 9, 2015 • edited by TomDLT Loading

jnothman commented Apr 9, 2015

jnothman commented Apr 9, 2015

jnothman commented Apr 9, 2015

selwyth commented Apr 9, 2015

amueller commented Apr 9, 2015

amueller commented Apr 9, 2015

jnothman commented Apr 11, 2015

jnothman commented Apr 13, 2015

jnothman commented Apr 13, 2015

amueller commented Apr 13, 2015

jnothman commented Apr 9, 2015 •

edited by TomDLT

Loading