I took a look at the ICML'07 paper (Raina et al.) introducing this term. I assume you are interested in implementing the specific technique they introduce (or some variant on it), rather than the broader class of solutions to the problem they pose.

Although it is not a constraint of their general problem formulation, their technique more-or-less involves fitting a transformer on a lot of unlabelled data, then applying the transformation before classification. So it merely comes down to something like:

class SelfTaughtLearner(BaseEstimator):
    def __init__(transformer, estimator):
        ...

    def fit(self, X, y):
        mask = y == -1
        self.transformer.fit(X[safe_mask(X, mask)])
        Xt = self.transformer.transform(X[safe_mask(X, ~mask)])
        self.estimator.fit(Xt, y[safe_mask(y, ~mask)])
        return self

    def predict(self, X):
        Xt = self.transformer.transform(X)
        return self.estimator.predict(Xt)

I note that this would be a nice framework for many scikit-learn dimensionality reduction (including feature agglomeration) techniques.

(Presumably, this should include support for out-of-core learning of the transformer, as there can be lots of unlabelled data. One annoyance of the current semi-supervised API is that selecting portions where y == -1 necessarily involves a copy, hence invalidating use of memmaps to avoid the out-of-core problem. If we required unlabelled portions to be at the beginning/end of the data, we could slice without copy.)

Uh oh!

Meta-estimator for semi-supervised learning #1243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions