Skip to content

WIP: Semisupervised Naive Bayes using Expectation Maximization #430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

larsmans
Copy link
Member

@larsmans larsmans commented Nov 7, 2011

Here's the EM algorithm for semisupervised Naive Bayes. The implementation checks for convergence based on the coefficients following the advice of Bishop 2006, so it could be used more generally for linear classifier self-training, but I only implemented the necessary machinery (fitting on a 1-of-K vector) on the discrete NB estimators and we might want to switch to log-likelihood based convergence checking later.

Narrative documentation follows if there's interest in this pull request. I've adapted the document classification example script into a new, semisupervised example. We might also merge these scripts.

@ogrisel
Copy link
Member

ogrisel commented Nov 7, 2011

Looks very interesting. For lazy people here is the outcome of the document classification example with 4 of the 20newsgroups with only 10% of samples in the training set having labels (around 200 labeled samples and 1800 without labels):

================================================================================
Baseline: fully supervised Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, fit_prior=True)
train time: 0.009s
test time:  0.003s
f1-score:   0.809
dimensionality: 32101


________________________________________________________________________________
Training: 
BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True)
train time: 0.010s
test time:  0.022s
f1-score:   0.818
dimensionality: 32101


================================================================================
Naive Bayes trained with Expectation Maximization
________________________________________________________________________________
Training: 
EMNB(estimator=MultinomialNB(alpha=0.01, fit_prior=True),
   estimator__alpha=0.01, estimator__fit_prior=True, n_iter=10, tol=0.001,
   verbose=False)
train time: 0.197s
test time:  0.003s
f1-score:   0.883
dimensionality: 32101


________________________________________________________________________________
Training: 
EMNB(estimator=BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True),
   estimator__alpha=0.01, estimator__binarize=0.0,
   estimator__fit_prior=True, n_iter=10, tol=0.001, verbose=False)
train time: 0.416s
test time:  0.021s
f1-score:   0.864
dimensionality: 32101

Semi supervised learning is practically important in my opinion because of the cost of annotating data with supervised signal. It can help annotators bootstrap a processus by annotating a small proportion of a dataset. Then we can do some sort of active learning by asking the fitted model for the samples with the least confidence (according to predict_proba / decision function being close to the threshold) and ask the human annotator to label those examples first.

Do you have an idea with the CPU time of semi-supervised over supervised is always 10x or is there some asymptotic complexity that makes it not scalable to larger problems (i.e. 100k samples)?

This work is related to the LabelPropagation pull request (that needs a final review, i.e. profiling it and checking whether the eigen problem cannot be sped up). I would be curious to see if the label propagation branch could be able to handle sparse input so as to compare both methods on the 20newsgroups dataset.


old_coef = np.copy(clf.coef_)
old_intercept = np.copy(clf.intercept_)
Y = clf.predict_proba(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relabeling all the examples, have you considered relabeling only the unlabeled ones? One one hand, relabeling the examples for which we know the human-assigned label seems like a waste. On the other hand, it can be more robust if many labels are actually incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. That is in fact what Nigam et al. do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read a paper on label propogation recently in which the authors advocated relabeling everything. So we may want to create an option to let the user choose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblondel: added this in 1f7c45823bb88318296c0db49b17013d6e8cadc5.

Sorry, hadn't seen your comment.

@larsmans
Copy link
Member Author

larsmans commented Nov 8, 2011

@ogrisel: the time required should be simply n_iter times the time it takes to fit a single NB classifier + some overhead in the convergence checking.

if self.verbose:
print "Naive Bayes EM, iteration %d," % i,

clf._fit1ofK(X, Y, sample_weight, class_prior)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the code correctly, in the very first call to this line (first iteration of the loop), Y should include -1 as labels. Does it affect the underlying classifier or does the classifier handle the -1 differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've hacked the LabelBinarizer to produce uniform probabilities where it encounters an unlabeled sample. I just changed the EMNB code to start from a supervised model, though, so this no longer happens.

@mblondel
Copy link
Member

mblondel commented Nov 8, 2011

  • I'm wondering if the view (sclicing?) X[y==-1] has an impact on performance. Before we merge this branch and the label propagation one, I would really like to clear that up. Taking the convention that unlabeled examples should always be at the end of the dataset may help.
  • It seems to me that we may want to not tight this EM class to Naive Bayes: we should be able to apply the same kind of EM algorithm to any classifier with soft labeling using predict_proba if available and hard labeling with predict if not. We need to establish a way to know what are the model parameters though (for the convergence check).
  • A plot accuracy vs percentage of labeled data would be nicer than a text-based output :)

@larsmans
Copy link
Member Author

larsmans commented Nov 8, 2011

I guess that if we check convergence based on the classifier output (training set accuracy), then this becomes a very general self-training algorithm.

Turns out the slicing is indeed incredibly expensive. I just cached it and training time decreases by a factor of five!

@ogrisel
Copy link
Member

ogrisel commented Nov 8, 2011

This is array masking (that indeed causes memory allocation) rather than slicing which is fast as it creates a cheap view without memory allocation.

@larsmans
Copy link
Member Author

larsmans commented Nov 9, 2011

@ogrisel: I might not be familiar enough with NumPy/SciPy's masking and slicing yet, but I just picked the option that works with sparse matrices, since text classification is the intended use case. We can add switching on type for smarter labeled/unlabeled selection later; currently, EMNB wants a BaseDiscreteNB anyway, since I don't have a use case for Gaussian NB. Please try the demo on the full 20newsgroups set and decide whether the performance is ok.

@mblondel: as regards extending this to other linear classifiers, I suggest we keep EMNB and maybe later introduce a new, generalized self-training meta-estimator for other classifiers. Naive Bayes EM has plenty of interesting routes for expansion, including optimizing for likelihood and the new SFE algorithm, so any duplication is not necessarily harmful.

I'm willing to write docs and fix bugs, but not to put very much more time into this PR for now.

@fannix
Copy link
Contributor

fannix commented Dec 15, 2011

Strange, the following is what I got. EM made the accuracy drop.

================================================================================
Baseline: fully supervised Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, fit_prior=True)
train time: 0.008s
test time:  0.004s
f1-score:   0.799
dimensionality: 32101


________________________________________________________________________________
Training: 
BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True)
train time: 0.007s
test time:  0.021s
f1-score:   0.799
dimensionality: 32101


================================================================================
Naive Bayes trained with Expectation Maximization
________________________________________________________________________________
Training: 
EMNB(estimator=MultinomialNB(alpha=0.01, fit_prior=True),
   estimator__alpha=0.01, estimator__fit_prior=True, n_iter=10, tol=0.001,
   verbose=False)
train time: 0.057s
test time:  0.004s
f1-score:   0.761
dimensionality: 32101


________________________________________________________________________________
Training: 
EMNB(estimator=BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True),
   estimator__alpha=0.01, estimator__binarize=0.0,
   estimator__fit_prior=True, n_iter=10, tol=0.001, verbose=False)
train time: 0.101s
test time:  0.020s
f1-score:   0.790
dimensionality: 32101

@larsmans
Copy link
Member Author

@fannix: how much labeled and unlabeled data did you use?

@fannix
Copy link
Contributor

fannix commented Dec 16, 2011

@larsmans: It was just the demo examples.

@fannix
Copy link
Contributor

fannix commented Dec 16, 2011

Also, label_1ofK() don't handle -1 well. When I run Y =
clf.estimator._label_1ofK(np.array([0, 1, -1])). It returns

array([[ 1. , 0. ],
[ 0. , 1. ],
[ 0.5, 0.5]])

It seems -1 is not a good way to represent missing labels. Have you fixed
this problem?

On Fri, Dec 16, 2011 at 1:47 AM, Lars Buitinck <
reply@reply.github.com

wrote:

@fannix: how much labeled and unlabeled data did you use?


Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@fannix
Copy link
Contributor

fannix commented Dec 16, 2011

Sorry, I wrongly interpret the result. It is this the intended result?

On Fri, Dec 16, 2011 at 11:31 AM, xinfan meng mxf3306@gmail.com wrote:

Also, label_1ofK() don't handle -1 well. When I run Y =
clf.estimator._label_1ofK(np.array([0, 1, -1])). It returns

array([[ 1. , 0. ],
[ 0. , 1. ],
[ 0.5, 0.5]])

It seems -1 is not a good way to represent missing labels. Have you fixed
this problem?

On Fri, Dec 16, 2011 at 1:47 AM, Lars Buitinck <
reply@reply.github.com

wrote:

@fannix: how much labeled and unlabeled data did you use?


Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@larsmans
Copy link
Member Author

Yes, this is the intended result. When no label is given, assume uniform probability. I'll look into the performance next week.

@fannix
Copy link
Contributor

fannix commented Dec 16, 2011

Great, hope to hear from you soon. I check the EMNB code. IMHO, E-step
(clf.predict_proba) should be performed first, instead of the M-step
(clf._fit1ofK)

On Fri, Dec 16, 2011 at 5:09 PM, Lars Buitinck <
reply@reply.github.com

wrote:

Yes, this is the intended result. When no label is given, assume uniform
probability. I'll look into the performance next week.


Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@larsmans
Copy link
Member Author

You're right about that.

@larsmans
Copy link
Member Author

@fannix: I've reproduced the bad performance you saw. This seems to be due to commit b5bc737, i.e. only re-labeling the unlabeled samples.

@fannix
Copy link
Contributor

fannix commented Dec 19, 2011

That seems interesting, maybe you could add a flag to control whether to
relabel the unlabeled samples. I remember there is also a variant of EM
algorithm in the original paper that use the relabel-all-samples scheme.

On Mon, Dec 19, 2011 at 6:04 PM, Lars Buitinck <
reply@reply.github.com

wrote:

@fannix: I've reproduced the bad performance you saw. This seems to be due
to commit b5bc737, i.e. only re-labeling
the unlabeled samples.


Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

@mblondel
Copy link
Member

+1 for having both schemes. Relabeling all instances is useful if you
have label noise in your data.

@larsmans
Copy link
Member Author

@mblondel: I'm writing docs, care to review the code?


def __init__(self, estimator, n_iter=10, relabel_all=True, tol=1e-3,
verbose=False):
self.estimator = estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we raise an error if estimator is not a BaseDiscreteNB?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I want to extend it to handle BaseNB first.

@mblondel
Copy link
Member

The estimator and the modifications you made to LabelBinarizer look good to me.

Regarding the example, it would be nice to modify the 20 newsgroup dataset loader so as to return ready-to-use features. This way, we wouldn't have to call Vectorizer (the purpose of the example is to illustrate semi-supervised learning, not feature extraction). If you feel like implementing it, a plot with the proportion of labeled data as x-axis and the accuracy as y-axis would be nice, but this PR can be merged without it IMO.

@ogrisel
Copy link
Member

ogrisel commented Dec 19, 2011

@mblondel I agree that this code gets duplicated in many examples.

I think we should have a load_vectorized_20newsgroups utility in this module that uses joblib memoizer to cache the results of the vectorization in the data_home folder.

@larsmans
Copy link
Member Author

Gone from WIP to MRG. I think I'm done for the moment.

@amueller
Copy link
Member

Doctest failure

File "scikit-learn/sklearn/preprocessing/__init__.py", line 494, in sklearn.preprocessing.LabelBinarizerFailed example:
    clf.transform([1, 6])
Expected:
    array([[ 1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  1.]])
Got:
    array([[ 0.25,  0.25,  0.25,  0.25],
           [ 0.  ,  0.  ,  0.  ,  1.  ]])

>>  raise self.failureException(self.format_failure(.getvalue()))


@amueller
Copy link
Member

Please rebase onto master

Semisupervised training with EM
-------------------------------

The class ``SemisupervisedNB`` implements the expectation maximization (EM)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:SemisupervisedNB

This EM algorithm fits an initial model, then iteratively

* uses the current to predict fractional class memberships;
* fits a new model on its own predictions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should somehow say that it is related to self trained learning and link to wikipedia.

@larsmans
Copy link
Member Author

I still don't fully agree with the idea of "flattening" SemisupervisedNB so as not to be a meta-estimator. One problem with this is that the class with have to duplicate the parameters of the underlying model to get a comprehensive repr:

In [2]: SemisupervisedNB("bernoulli", alpha=1, binarize=2.)
Out[2]: 
SemisupervisedNB(event_model='bernoulli', n_iter=10, relabel_all=True,
          tol=1e-05, verbose=False)

alpha and binarize aren't printed. In the meta-estimator case, we got this for free.

So, I suggest keeping the meta-estimator design after. I know that "Flat is better than nested", but "There should be one--and preferably only one--obvious way to do it" and "that way may not be obvious at first unless you're Dutch." ;)

@ogrisel
Copy link
Member

ogrisel commented Dec 22, 2011

@larsmans the Dutch argument is cheating :)

I will try to find some time to read the code / examples to make my own French / Canadian opinion on how the meta-estimator-style API feels in practice.

@fannix
Copy link
Contributor

fannix commented Dec 23, 2011

I think one thing is missing here: the document length normalization, which is used in the original paper.

@ogrisel
Copy link
Member

ogrisel commented Dec 23, 2011

@fannix the Vectorizer class always takes care of that by default.

@larsmans
Copy link
Member Author

I gave alternative representations of "unlabeled" some more thought, but there seems to be no more "natural" value than -1 that is representable in all relevant types. Notably, None translates to nan as an np.float, but is not representable in np.int, which is one of the obvious candidates for class label types.

@amueller
Copy link
Member

@larsmans As I said in Malaga, I agree with you. It wasn't obvious to me why -1 is the best choice but we shouldn't over-complicate it ;)

@larsmans
Copy link
Member Author

@amueller, yes, I just wanted it on record here for @fannix and others :)

@ogrisel
Copy link
Member

ogrisel commented Dec 26, 2011

======================================================================
FAIL: Doctest: sklearn.preprocessing.LabelBinarizer
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2166, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.preprocessing.LabelBinarizer
  File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 458, in LabelBinarizer

----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 495, in sklearn.preprocessing.LabelBinarizer
Failed example:
    clf.fit([1, 2, 6, 4, 2])
Expected:
    LabelBinarizer()
Got:
    LabelBinarizer(unlabeled=-1)
----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 499, in sklearn.preprocessing.LabelBinarizer
Failed example:
    clf.transform([1, 6])
Expected:
    array([[ 1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  1.]])
Got:
    array([[ 0.25,  0.25,  0.25,  0.25],
           [ 0.  ,  0.  ,  0.  ,  1.  ]])

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x46b2e60>.getvalue()))

for clf in (BernoulliNB(), MultinomialNB()):
semi_clf = SemisupervisedNB(clf, n_iter=20, tol=1e6)
semi_clf.fit(X, y)
assert_array_equal(semi_clf.predict([[5, 0, 0], [1, 1, 4]]), [1, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coverage report shows that this test is a bit two easy as the convergence is reached at the first iteration (old_coef and old_intercept are not updated). Also I would update this test to check that the input of a dense array X and its CSR variant yield the same outcome (in terms of predicted proba for instance).

@ogrisel
Copy link
Member

ogrisel commented Dec 27, 2011

About the nested constructor issue: I am still not sold to the current API. Here is another proposal: what about making the SemisupervisedNB a mixin class and generate two concrete classes SemisupervisedBernoulliNB and SemisupervisedMultinomialNB? As the current implementation is using lambda expressions to simulate some kind of inheritance to build 3 readonly properties and one method, using real inheritance through mixins might make more sense here and futhermore we would get flat constructor for free and the ability for advanced users to extend through inheritance rather than composition. WDYT?

@fannix
Copy link
Contributor

fannix commented Dec 27, 2011

This sounds good. I think it might be better to have a Semisupervised mixin class which can admit unlabeled data.

@amueller
Copy link
Member

amueller commented Nov 6, 2012

As estimator can only be BernoulliNB or MultinomialNB and the only one that will probably added in the future will be GaussianNB, I think I am +1 for keywords or what @ogrisel proposed above.

@larsmans
Copy link
Member Author

If I ever try this again I'll just rewrite the code, so I'm closing this PR.

@maniteja123
Copy link
Contributor

Hi @larsmans , I was looking at the previous enhancements in semi supervised learning and came across this. But this was wayback before I even knew Machine Learning :) I would like to try implementing this if it is okay. Please let me know your thoughts. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants