Skip to content

Pipeline on labels_ instead of just transforms #4543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
selwyth opened this issue Apr 8, 2015 · 15 comments
Closed

Pipeline on labels_ instead of just transforms #4543

selwyth opened this issue Apr 8, 2015 · 15 comments

Comments

@selwyth
Copy link

selwyth commented Apr 8, 2015

My current use case is to cluster a bunch of data to generate labels, then use a classifier to draw decision boundaries and make label predictions on new data.

I love using pipeline and gridsearch packages in tandem, but can't use it for this because clustering does not have a transform method for X. Is it worthwhile to make pipeline components chain-able by the labels_ attribute in addition to transform?

@jnothman
Copy link
Member

jnothman commented Apr 8, 2015

Some clusterers can be used as transformers, e.g. centroid distance in
KMeans. Can you be more explicit as to what the transformed feature space
looks like? In any case, this functionality can be provided by inheriting
from an existing clusterer and a mixin that provides transform.

On 8 April 2015 at 10:06, selwyth notifications@github.com wrote:

My current use case is to cluster a bunch of data to generate labels,
then use a classifier to draw decision boundaries and make label
predictions on new data.

I love using pipeline and gridsearch packages in tandem, but can't use it
for this because clustering does not have a transform method for X. Is it
worthwhile to make pipeline components chain-able by the labels_ attribute
in addition to transform?


Reply to this email directly or view it on GitHub
#4543.

@amueller
Copy link
Member

amueller commented Apr 8, 2015

I'm not entirely sure I understand the question. Do you want the labels_ to be the X or the y? You want it to be the labels, right? then this is duplicate of #4143, right?

@selwyth
Copy link
Author

selwyth commented Apr 8, 2015

Yes, want labels_ to be y. It's very similar to #4143 -- that one proposes transforms to existing y; I'm thinking of generating a y where there isn't one previously, then sending it down the pipeline.

Trying to be more explicit here: say I want to classify the iris dataset but don't have the labels to train it on. I would like to cluster the data (say using KMeans), then use the labels_ as the y for a classifier (say kNN) on out-of-sample data. Would like to put these in a pipeline so I can tune the hyperparameters for both steps (e.g. both n_clusters and n_neighbors). Basically, I'm wondering if it's worthwhile to be able to pipeline the conversion of an unsupervised learning problem into a supervised learning problem. Hope that helps.

@amueller
Copy link
Member

amueller commented Apr 8, 2015

I can see what you are trying to do, and it will probably be possible soonish.
I don't know if it makes sense, though. If you evaluate on the final classification, your result will be best if you clustered all points in the same class, as that makes the task for the classifier trivial.

@jnothman
Copy link
Member

jnothman commented Apr 9, 2015

Right. I get what you're doing now. It's not hard to construct a meta-estimator that does this, and I think it's a fairly common way of making clusterers inductive.

I've not tested it, but:

from sklearn.base import clone, BaseEstimator
from sklearn.utils.metaestimators import if_delegate_has_method

class InductiveClusterer(BaseEstimator):
    def __init__(self, clusterer, classifier):
        self.clusterer = clusterer
        self.classifier = classifier

    def fit(self, X, y=None):
        self.clusterer_ = clone(self.clusterer)
        self.classifier_ = clone(self.classifier)
        y = self.clusterer_.fit_predict(X)
        self.classifier_.fit(X, y)
        return self

    @if_delegate_has_method(delegate='classifier_')
    def predict(self, X):
        return self.classifier_.predict(X)

    @if_delegate_has_method(delegate='classifier_')
    def decision_function(self, X):
        return self.classifier_.decision_function(X)

    # etc...

@amueller

If you evaluate on the final classification, your result will be best if you clustered all points in the same class, as that makes the task for the classifier trivial.

Not if it's a clustering evaluation....?

@jnothman
Copy link
Member

jnothman commented Apr 9, 2015

PS: with clusterer=DBSCAN() and classifier=KNearestNeighbors() you can set parameters such as classifier__n_neighbors=3 and clusterer__eps=.5.

@jnothman
Copy link
Member

jnothman commented Apr 9, 2015

Also, @amueller, this could be used to learn a transformed feature space, an L1 feature selector etc that are informed by the clustering, rather than merely for the purpose of clustering.

@jnothman
Copy link
Member

jnothman commented Apr 9, 2015

@selwyth, if you find the above code snippet useful, I think it would make a helpful example for sklearn's documentation. Feel free to flesh out an illustrative example of the technique using a shared dataset and submit a PR.

@selwyth
Copy link
Author

selwyth commented Apr 9, 2015

@amueller That's a good point. Perhaps have to warn user to limit searches to higher n_clusters, or use inertia_ (at least in the case of KMeans) as a penalty parameter to discourage low numbers of clusters that trivialize the classifier's task.

@jnothman Thank you for the snippet. I understand it, will study it some more, think through the other methods in the 'etc' and test before submitting a PR for documentation on this example. Forgot to mention the feature space is different for both steps in my example, but I see how to easily work that into your meta-estimator.

@amueller
Copy link
Member

amueller commented Apr 9, 2015

Ah, I misunderstood the setting a bit. If you use an unsupervised clustering metric in the end, it makes sense.

@amueller
Copy link
Member

amueller commented Apr 9, 2015

I guess the point is then to add a predict to a clustering algorithm that otherwise wouldn't have one. That makes sense.

@jnothman
Copy link
Member

Forgot to mention the feature space is different for both steps in my example, but I see how to easily work that into your meta-estimator.

Well, for the feature space to be different, you'll almost certainly need custom code. However, you could do that with both the clusterer and the classifier being a Pipeline incorporating feature extraction / transformation. However, currently Pipeline doesn't support fit_predict. New issue at #4572

@jnothman
Copy link
Member

This issue has helped identify #4572, which has since been fixed! But otherwise, I don't think we need heavy internal changes to support this case, but rather should either add the likes of InductiveClusterer either in the project, or as an example. If you agree, @amueller, we should make specific issues for whichever of those we see appropriate, and close this issue.

@jnothman
Copy link
Member

Ah, rereading, I see that @selwyth intends to submit a PR. I look forward to it. Regarding "think through the other methods in the 'etc'", see the set of methods delegated by Pipeline or BaseSearchCV.

@amueller
Copy link
Member

Feel free to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants