WIP: Semisupervised Naive Bayes using Expectation Maximization #430

larsmans · 2011-11-07T22:14:35Z

Here's the EM algorithm for semisupervised Naive Bayes. The implementation checks for convergence based on the coefficients following the advice of Bishop 2006, so it could be used more generally for linear classifier self-training, but I only implemented the necessary machinery (fitting on a 1-of-K vector) on the discrete NB estimators and we might want to switch to log-likelihood based convergence checking later.

Narrative documentation follows if there's interest in this pull request. I've adapted the document classification example script into a new, semisupervised example. We might also merge these scripts.

ogrisel · 2011-11-07T23:16:35Z

Looks very interesting. For lazy people here is the outcome of the document classification example with 4 of the 20newsgroups with only 10% of samples in the training set having labels (around 200 labeled samples and 1800 without labels):

================================================================================
Baseline: fully supervised Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, fit_prior=True)
train time: 0.009s
test time:  0.003s
f1-score:   0.809
dimensionality: 32101


________________________________________________________________________________
Training: 
BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True)
train time: 0.010s
test time:  0.022s
f1-score:   0.818
dimensionality: 32101


================================================================================
Naive Bayes trained with Expectation Maximization
________________________________________________________________________________
Training: 
EMNB(estimator=MultinomialNB(alpha=0.01, fit_prior=True),
   estimator__alpha=0.01, estimator__fit_prior=True, n_iter=10, tol=0.001,
   verbose=False)
train time: 0.197s
test time:  0.003s
f1-score:   0.883
dimensionality: 32101


________________________________________________________________________________
Training: 
EMNB(estimator=BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True),
   estimator__alpha=0.01, estimator__binarize=0.0,
   estimator__fit_prior=True, n_iter=10, tol=0.001, verbose=False)
train time: 0.416s
test time:  0.021s
f1-score:   0.864
dimensionality: 32101

Semi supervised learning is practically important in my opinion because of the cost of annotating data with supervised signal. It can help annotators bootstrap a processus by annotating a small proportion of a dataset. Then we can do some sort of active learning by asking the fitted model for the samples with the least confidence (according to predict_proba / decision function being close to the threshold) and ask the human annotator to label those examples first.

Do you have an idea with the CPU time of semi-supervised over supervised is always 10x or is there some asymptotic complexity that makes it not scalable to larger problems (i.e. 100k samples)?

This work is related to the LabelPropagation pull request (that needs a final review, i.e. profiling it and checking whether the eigen problem cannot be sped up). I would be curious to see if the label propagation branch could be able to handle sparse input so as to compare both methods on the 20newsgroups dataset.

mblondel · 2011-11-07T23:37:37Z

sklearn/naive_bayes.py

+
+            old_coef = np.copy(clf.coef_)
+            old_intercept = np.copy(clf.intercept_)
+            Y = clf.predict_proba(X)


Instead of relabeling all the examples, have you considered relabeling only the unlabeled ones? One one hand, relabeling the examples for which we know the human-assigned label seems like a waste. On the other hand, it can be more robust if many labels are actually incorrect.

Good point. That is in fact what Nigam et al. do.

I read a paper on label propogation recently in which the authors advocated relabeling everything. So we may want to create an option to let the user choose?

~~@mblondel: added this in 1f7c45823bb88318296c0db49b17013d6e8cadc5.~~

Sorry, hadn't seen your comment.

larsmans · 2011-11-08T11:59:40Z

@ogrisel: the time required should be simply n_iter times the time it takes to fit a single NB classifier + some overhead in the convergence checking.

mblondel · 2011-11-08T12:05:22Z

sklearn/naive_bayes.py

+            if self.verbose:
+                print "Naive Bayes EM, iteration %d," % i,
+
+            clf._fit1ofK(X, Y, sample_weight, class_prior)


If I understand the code correctly, in the very first call to this line (first iteration of the loop), Y should include -1 as labels. Does it affect the underlying classifier or does the classifier handle the -1 differently?

I've hacked the LabelBinarizer to produce uniform probabilities where it encounters an unlabeled sample. I just changed the EMNB code to start from a supervised model, though, so this no longer happens.

mblondel · 2011-11-08T12:19:04Z

I'm wondering if the view (sclicing?) X[y==-1] has an impact on performance. Before we merge this branch and the label propagation one, I would really like to clear that up. Taking the convention that unlabeled examples should always be at the end of the dataset may help.
It seems to me that we may want to not tight this EM class to Naive Bayes: we should be able to apply the same kind of EM algorithm to any classifier with soft labeling using predict_proba if available and hard labeling with predict if not. We need to establish a way to know what are the model parameters though (for the convergence check).
A plot accuracy vs percentage of labeled data would be nicer than a text-based output :)

larsmans · 2011-11-08T12:31:19Z

I guess that if we check convergence based on the classifier output (training set accuracy), then this becomes a very general self-training algorithm.

Turns out the slicing is indeed incredibly expensive. I just cached it and training time decreases by a factor of five!

ogrisel · 2011-11-08T23:39:29Z

This is array masking (that indeed causes memory allocation) rather than slicing which is fast as it creates a cheap view without memory allocation.

larsmans · 2011-11-09T00:51:39Z

@ogrisel: I might not be familiar enough with NumPy/SciPy's masking and slicing yet, but I just picked the option that works with sparse matrices, since text classification is the intended use case. We can add switching on type for smarter labeled/unlabeled selection later; currently, EMNB wants a BaseDiscreteNB anyway, since I don't have a use case for Gaussian NB. Please try the demo on the full 20newsgroups set and decide whether the performance is ok.

@mblondel: as regards extending this to other linear classifiers, I suggest we keep EMNB and maybe later introduce a new, generalized self-training meta-estimator for other classifiers. Naive Bayes EM has plenty of interesting routes for expansion, including optimizing for likelihood and the new SFE algorithm, so any duplication is not necessarily harmful.

I'm willing to write docs and fix bugs, but not to put very much more time into this PR for now.

fannix · 2011-12-15T17:29:35Z

Strange, the following is what I got. EM made the accuracy drop.

================================================================================
Baseline: fully supervised Naive Bayes
________________________________________________________________________________
Training: 
MultinomialNB(alpha=0.01, fit_prior=True)
train time: 0.008s
test time:  0.004s
f1-score:   0.799
dimensionality: 32101


________________________________________________________________________________
Training: 
BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True)
train time: 0.007s
test time:  0.021s
f1-score:   0.799
dimensionality: 32101


================================================================================
Naive Bayes trained with Expectation Maximization
________________________________________________________________________________
Training: 
EMNB(estimator=MultinomialNB(alpha=0.01, fit_prior=True),
   estimator__alpha=0.01, estimator__fit_prior=True, n_iter=10, tol=0.001,
   verbose=False)
train time: 0.057s
test time:  0.004s
f1-score:   0.761
dimensionality: 32101


________________________________________________________________________________
Training: 
EMNB(estimator=BernoulliNB(alpha=0.01, binarize=0.0, fit_prior=True),
   estimator__alpha=0.01, estimator__binarize=0.0,
   estimator__fit_prior=True, n_iter=10, tol=0.001, verbose=False)
train time: 0.101s
test time:  0.020s
f1-score:   0.790
dimensionality: 32101

larsmans · 2011-12-15T17:47:24Z

@fannix: how much labeled and unlabeled data did you use?

fannix · 2011-12-16T00:44:11Z

@larsmans: It was just the demo examples.

fannix · 2011-12-16T03:31:29Z

Also, label_1ofK() don't handle -1 well. When I run Y =
clf.estimator._label_1ofK(np.array([0, 1, -1])). It returns

array([[ 1. , 0. ],
[ 0. , 1. ],
[ 0.5, 0.5]])

It seems -1 is not a good way to represent missing labels. Have you fixed
this problem?

On Fri, Dec 16, 2011 at 1:47 AM, Lars Buitinck <
reply@reply.github.com

wrote:

@fannix: how much labeled and unlabeled data did you use?

Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan（蒙新泛）
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

fannix · 2011-12-16T05:39:43Z

Sorry, I wrongly interpret the result. It is this the intended result?

On Fri, Dec 16, 2011 at 11:31 AM, xinfan meng mxf3306@gmail.com wrote:

Also, label_1ofK() don't handle -1 well. When I run Y =
clf.estimator._label_1ofK(np.array([0, 1, -1])). It returns

array([[ 1. , 0. ],
[ 0. , 1. ],
[ 0.5, 0.5]])

It seems -1 is not a good way to represent missing labels. Have you fixed
this problem?

On Fri, Dec 16, 2011 at 1:47 AM, Lars Buitinck <
reply@reply.github.com

wrote:

@fannix: how much labeled and unlabeled data did you use?

Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan（蒙新泛）
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

Best Wishes

Meng Xinfan（蒙新泛）
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

larsmans · 2011-12-16T09:09:01Z

Yes, this is the intended result. When no label is given, assume uniform probability. I'll look into the performance next week.

fannix · 2011-12-16T09:17:59Z

Great, hope to hear from you soon. I check the EMNB code. IMHO, E-step
(clf.predict_proba) should be performed first, instead of the M-step
(clf._fit1ofK)

On Fri, Dec 16, 2011 at 5:09 PM, Lars Buitinck <
reply@reply.github.com

wrote:

Yes, this is the intended result. When no label is given, assume uniform
probability. I'll look into the performance next week.

Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan（蒙新泛）
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

larsmans · 2011-12-16T09:27:33Z

You're right about that.

larsmans · 2011-12-19T10:04:23Z

@fannix: I've reproduced the bad performance you saw. This seems to be due to commit b5bc737, i.e. only re-labeling the unlabeled samples.

fannix · 2011-12-19T10:11:02Z

That seems interesting, maybe you could add a flag to control whether to
relabel the unlabeled samples. I remember there is also a variant of EM
algorithm in the original paper that use the relabel-all-samples scheme.

On Mon, Dec 19, 2011 at 6:04 PM, Lars Buitinck <
reply@reply.github.com

wrote:

@fannix: I've reproduced the bad performance you saw. This seems to be due
to commit b5bc737, i.e. only re-labeling
the unlabeled samples.

Reply to this email directly or view it on GitHub:
#430 (comment)

Best Wishes

Meng Xinfan（蒙新泛）
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

mblondel · 2011-12-19T15:42:56Z

+1 for having both schemes. Relabeling all instances is useful if you
have label noise in your data.

larsmans · 2011-12-19T16:10:14Z

@mblondel: I'm writing docs, care to review the code?

mblondel · 2011-12-19T16:30:10Z

sklearn/naive_bayes.py

+
+    def __init__(self, estimator, n_iter=10, relabel_all=True, tol=1e-3,
+                 verbose=False):
+        self.estimator = estimator


shall we raise an error if estimator is not a BaseDiscreteNB?

Good point. I want to extend it to handle BaseNB first.

mblondel · 2011-12-19T17:15:18Z

The estimator and the modifications you made to LabelBinarizer look good to me.

Regarding the example, it would be nice to modify the 20 newsgroup dataset loader so as to return ready-to-use features. This way, we wouldn't have to call Vectorizer (the purpose of the example is to illustrate semi-supervised learning, not feature extraction). If you feel like implementing it, a plot with the proportion of labeled data as x-axis and the accuracy as y-axis would be nice, but this PR can be merged without it IMO.

ogrisel · 2011-12-19T17:49:14Z

@mblondel I agree that this code gets duplicated in many examples.

I think we should have a load_vectorized_20newsgroups utility in this module that uses joblib memoizer to cache the results of the vectorization in the data_home folder.

larsmans · 2011-12-20T15:08:31Z

Gone from WIP to MRG. I think I'm done for the moment.

amueller · 2011-12-20T16:50:08Z

Doctest failure

File "scikit-learn/sklearn/preprocessing/__init__.py", line 494, in sklearn.preprocessing.LabelBinarizerFailed example:
    clf.transform([1, 6])
Expected:
    array([[ 1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  1.]])
Got:
    array([[ 0.25,  0.25,  0.25,  0.25],
           [ 0.  ,  0.  ,  0.  ,  1.  ]])

>>  raise self.failureException(self.format_failure(.getvalue()))

amueller · 2011-12-20T16:51:15Z

Please rebase onto master

amueller · 2011-12-20T16:53:01Z

doc/modules/naive_bayes.rst

+Semisupervised training with EM
+-------------------------------
+
+The class ``SemisupervisedNB`` implements the expectation maximization (EM)


:class:SemisupervisedNB

amueller · 2011-12-20T17:39:11Z

doc/modules/naive_bayes.rst

+This EM algorithm fits an initial model, then iteratively
+
+ * uses the current to predict fractional class memberships;
+ * fits a new model on its own predictions


I think it should somehow say that it is related to self trained learning and link to wikipedia.

larsmans · 2011-12-21T15:36:34Z

I still don't fully agree with the idea of "flattening" SemisupervisedNB so as not to be a meta-estimator. One problem with this is that the class with have to duplicate the parameters of the underlying model to get a comprehensive repr:

In [2]: SemisupervisedNB("bernoulli", alpha=1, binarize=2.)
Out[2]: 
SemisupervisedNB(event_model='bernoulli', n_iter=10, relabel_all=True,
          tol=1e-05, verbose=False)

alpha and binarize aren't printed. In the meta-estimator case, we got this for free.

So, I suggest keeping the meta-estimator design after. I know that "Flat is better than nested", but "There should be one--and preferably only one--obvious way to do it" and "that way may not be obvious at first unless you're Dutch." ;)

Conflicts: sklearn/preprocessing/__init__.py

ogrisel · 2011-12-22T10:47:15Z

@larsmans the Dutch argument is cheating :)

I will try to find some time to read the code / examples to make my own French / Canadian opinion on how the meta-estimator-style API feels in practice.

fannix · 2011-12-23T15:19:18Z

I think one thing is missing here: the document length normalization, which is used in the original paper.

ogrisel · 2011-12-23T15:24:32Z

@fannix the Vectorizer class always takes care of that by default.

larsmans · 2011-12-24T20:47:46Z

I gave alternative representations of "unlabeled" some more thought, but there seems to be no more "natural" value than -1 that is representable in all relevant types. Notably, None translates to nan as an np.float, but is not representable in np.int, which is one of the obvious candidates for class label types.

amueller · 2011-12-24T20:56:21Z

@larsmans As I said in Malaga, I agree with you. It wasn't obvious to me why -1 is the best choice but we shouldn't over-complicate it ;)

larsmans · 2011-12-24T21:37:56Z

@amueller, yes, I just wanted it on record here for @fannix and others :)

ogrisel · 2011-12-26T17:51:03Z

======================================================================
FAIL: Doctest: sklearn.preprocessing.LabelBinarizer
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2166, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.preprocessing.LabelBinarizer
  File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 458, in LabelBinarizer

----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 495, in sklearn.preprocessing.LabelBinarizer
Failed example:
    clf.fit([1, 2, 6, 4, 2])
Expected:
    LabelBinarizer()
Got:
    LabelBinarizer(unlabeled=-1)
----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/sklearn/preprocessing/__init__.py", line 499, in sklearn.preprocessing.LabelBinarizer
Failed example:
    clf.transform([1, 6])
Expected:
    array([[ 1.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  1.]])
Got:
    array([[ 0.25,  0.25,  0.25,  0.25],
           [ 0.  ,  0.  ,  0.  ,  1.  ]])

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x46b2e60>.getvalue()))

ogrisel · 2011-12-27T08:22:30Z

sklearn/tests/test_naive_bayes.py

+    for clf in (BernoulliNB(), MultinomialNB()):
+        semi_clf = SemisupervisedNB(clf, n_iter=20, tol=1e6)
+        semi_clf.fit(X, y)
+        assert_array_equal(semi_clf.predict([[5, 0, 0], [1, 1, 4]]), [1, 2])


The coverage report shows that this test is a bit two easy as the convergence is reached at the first iteration (old_coef and old_intercept are not updated). Also I would update this test to check that the input of a dense array X and its CSR variant yield the same outcome (in terms of predicted proba for instance).

ogrisel · 2011-12-27T08:50:30Z

About the nested constructor issue: I am still not sold to the current API. Here is another proposal: what about making the SemisupervisedNB a mixin class and generate two concrete classes SemisupervisedBernoulliNB and SemisupervisedMultinomialNB? As the current implementation is using lambda expressions to simulate some kind of inheritance to build 3 readonly properties and one method, using real inheritance through mixins might make more sense here and futhermore we would get flat constructor for free and the ability for advanced users to extend through inheritance rather than composition. WDYT?

fannix · 2011-12-27T09:06:38Z

This sounds good. I think it might be better to have a Semisupervised mixin class which can admit unlabeled data.

amueller · 2012-11-06T17:56:06Z

As estimator can only be BernoulliNB or MultinomialNB and the only one that will probably added in the future will be GaussianNB, I think I am +1 for keywords or what @ogrisel proposed above.

larsmans · 2013-09-13T14:36:10Z

If I ever try this again I'll just rewrite the code, so I'm closing this PR.

maniteja123 · 2016-03-18T12:27:50Z

Hi @larsmans , I was looking at the previous enhancements in semi supervised learning and came across this. But this was wayback before I even knew Machine Learning :) I would like to try implementing this if it is okay. Please let me know your thoughts. Thanks.

mblondel reviewed Nov 7, 2011
View reviewed changes

mblondel reviewed Nov 8, 2011
View reviewed changes

mblondel reviewed Dec 19, 2011
View reviewed changes

amueller reviewed Dec 20, 2011
View reviewed changes

larsmans added 2 commits December 20, 2011 18:38

ENH stratified label deletion in semisup example + stricter tol in EM NB

5fa51b9

TST EM for NB + add check for BaseDiscreteNB

fcb1142

amueller reviewed Dec 20, 2011
View reviewed changes

DOC point out similarity with self-training at @amueller's request

d0e4847

larsmans added 4 commits December 21, 2011 16:46

DOC SemisupervisedNB in classes.rst

0ce2348

Merge branch 'master' into emnb

be29a50

Conflicts: sklearn/preprocessing/__init__.py

COSMIT naive_bayes pyflakes-clean

80b9c3e

DOC link SemisupervisedNB

0e90277

COSMIT pep8

e0288ad

ogrisel reviewed Dec 27, 2011
View reviewed changes

amueller mentioned this pull request Nov 3, 2012

Expectation Maximization for Supervised Naive Bayes Classifiers #1310

Closed

larsmans closed this Sep 13, 2013

jnothman mentioned this pull request Dec 27, 2013

Feature request: Additional Clustering Algorithms:BIRCH #2690

Closed

larsmans deleted the emnb branch March 8, 2016 13:51

Uh oh!

WIP: Semisupervised Naive Bayes using Expectation Maximization #430

WIP: Semisupervised Naive Bayes using Expectation Maximization #430

Uh oh!

Conversation

larsmans commented Nov 7, 2011

Uh oh!

ogrisel commented Nov 7, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larsmans commented Nov 8, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mblondel commented Nov 8, 2011

Uh oh!

larsmans commented Nov 8, 2011

Uh oh!

ogrisel commented Nov 8, 2011

Uh oh!

larsmans commented Nov 9, 2011

Uh oh!

fannix commented Dec 15, 2011

Uh oh!

larsmans commented Dec 15, 2011

Uh oh!

fannix commented Dec 16, 2011

Uh oh!

fannix commented Dec 16, 2011

Best Wishes

Uh oh!

fannix commented Dec 16, 2011

Best Wishes

Best Wishes

Uh oh!

larsmans commented Dec 16, 2011

Uh oh!

fannix commented Dec 16, 2011

Best Wishes

Uh oh!

larsmans commented Dec 16, 2011

Uh oh!

larsmans commented Dec 19, 2011

Uh oh!

fannix commented Dec 19, 2011

Best Wishes

Uh oh!

mblondel commented Dec 19, 2011

Uh oh!

larsmans commented Dec 19, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mblondel commented Dec 19, 2011

Uh oh!

ogrisel commented Dec 19, 2011

Uh oh!

larsmans commented Dec 20, 2011

Uh oh!

amueller commented Dec 20, 2011

Uh oh!

amueller commented Dec 20, 2011

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larsmans commented Dec 21, 2011

Uh oh!