-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
WIP: Semisupervised Naive Bayes using Expectation Maximization #430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks very interesting. For lazy people here is the outcome of the document classification example with 4 of the 20newsgroups with only 10% of samples in the training set having labels (around 200 labeled samples and 1800 without labels):
Semi supervised learning is practically important in my opinion because of the cost of annotating data with supervised signal. It can help annotators bootstrap a processus by annotating a small proportion of a dataset. Then we can do some sort of active learning by asking the fitted model for the samples with the least confidence (according to predict_proba / decision function being close to the threshold) and ask the human annotator to label those examples first. Do you have an idea with the CPU time of semi-supervised over supervised is always 10x or is there some asymptotic complexity that makes it not scalable to larger problems (i.e. 100k samples)? This work is related to the LabelPropagation pull request (that needs a final review, i.e. profiling it and checking whether the eigen problem cannot be sped up). I would be curious to see if the label propagation branch could be able to handle sparse input so as to compare both methods on the 20newsgroups dataset. |
|
||
old_coef = np.copy(clf.coef_) | ||
old_intercept = np.copy(clf.intercept_) | ||
Y = clf.predict_proba(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of relabeling all the examples, have you considered relabeling only the unlabeled ones? One one hand, relabeling the examples for which we know the human-assigned label seems like a waste. On the other hand, it can be more robust if many labels are actually incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. That is in fact what Nigam et al. do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read a paper on label propogation recently in which the authors advocated relabeling everything. So we may want to create an option to let the user choose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mblondel: added this in 1f7c45823bb88318296c0db49b17013d6e8cadc5.
Sorry, hadn't seen your comment.
@ogrisel: the time required should be simply |
if self.verbose: | ||
print "Naive Bayes EM, iteration %d," % i, | ||
|
||
clf._fit1ofK(X, Y, sample_weight, class_prior) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the code correctly, in the very first call to this line (first iteration of the loop), Y should include -1 as labels. Does it affect the underlying classifier or does the classifier handle the -1 differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've hacked the LabelBinarizer
to produce uniform probabilities where it encounters an unlabeled sample. I just changed the EMNB
code to start from a supervised model, though, so this no longer happens.
|
I guess that if we check convergence based on the classifier output (training set accuracy), then this becomes a very general self-training algorithm. Turns out the slicing is indeed incredibly expensive. I just cached it and training time decreases by a factor of five! |
This is array masking (that indeed causes memory allocation) rather than slicing which is fast as it creates a cheap view without memory allocation. |
@ogrisel: I might not be familiar enough with NumPy/SciPy's masking and slicing yet, but I just picked the option that works with sparse matrices, since text classification is the intended use case. We can add switching on type for smarter labeled/unlabeled selection later; currently, @mblondel: as regards extending this to other linear classifiers, I suggest we keep I'm willing to write docs and fix bugs, but not to put very much more time into this PR for now. |
Strange, the following is what I got. EM made the accuracy drop.
|
@fannix: how much labeled and unlabeled data did you use? |
@larsmans: It was just the demo examples. |
Also, label_1ofK() don't handle -1 well. When I run Y = array([[ 1. , 0. ], It seems -1 is not a good way to represent missing labels. Have you fixed On Fri, Dec 16, 2011 at 1:47 AM, Lars Buitinck <
Best WishesMeng Xinfan(蒙新泛) |
Sorry, I wrongly interpret the result. It is this the intended result? On Fri, Dec 16, 2011 at 11:31 AM, xinfan meng mxf3306@gmail.com wrote:
Best WishesMeng Xinfan(蒙新泛) |
Yes, this is the intended result. When no label is given, assume uniform probability. I'll look into the performance next week. |
Great, hope to hear from you soon. I check the EMNB code. IMHO, E-step On Fri, Dec 16, 2011 at 5:09 PM, Lars Buitinck <
Best WishesMeng Xinfan(蒙新泛) |
You're right about that. |
That seems interesting, maybe you could add a flag to control whether to On Mon, Dec 19, 2011 at 6:04 PM, Lars Buitinck <
Best WishesMeng Xinfan(蒙新泛) |
+1 for having both schemes. Relabeling all instances is useful if you |
@mblondel: I'm writing docs, care to review the code? |
|
||
def __init__(self, estimator, n_iter=10, relabel_all=True, tol=1e-3, | ||
verbose=False): | ||
self.estimator = estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we raise an error if estimator
is not a BaseDiscreteNB
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I want to extend it to handle BaseNB
first.
The estimator and the modifications you made to Regarding the example, it would be nice to modify the 20 newsgroup dataset loader so as to return ready-to-use features. This way, we wouldn't have to call |
@mblondel I agree that this code gets duplicated in many examples. I think we should have a |
Gone from WIP to MRG. I think I'm done for the moment. |
Doctest failure File "scikit-learn/sklearn/preprocessing/__init__.py", line 494, in sklearn.preprocessing.LabelBinarizerFailed example: clf.transform([1, 6]) Expected: array([[ 1., 0., 0., 0.], [ 0., 0., 0., 1.]]) Got: array([[ 0.25, 0.25, 0.25, 0.25], [ 0. , 0. , 0. , 1. ]]) >> raise self.failureException(self.format_failure(.getvalue())) |
Please rebase onto master |
Semisupervised training with EM | ||
------------------------------- | ||
|
||
The class ``SemisupervisedNB`` implements the expectation maximization (EM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:SemisupervisedNB
This EM algorithm fits an initial model, then iteratively | ||
|
||
* uses the current to predict fractional class memberships; | ||
* fits a new model on its own predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should somehow say that it is related to self trained learning and link to wikipedia.
I still don't fully agree with the idea of "flattening"
So, I suggest keeping the meta-estimator design after. I know that "Flat is better than nested", but "There should be one--and preferably only one--obvious way to do it" and "that way may not be obvious at first unless you're Dutch." ;) |
Conflicts: sklearn/preprocessing/__init__.py
@larsmans the Dutch argument is cheating :) I will try to find some time to read the code / examples to make my own French / Canadian opinion on how the meta-estimator-style API feels in practice. |
I think one thing is missing here: the document length normalization, which is used in the original paper. |
@fannix the |
I gave alternative representations of "unlabeled" some more thought, but there seems to be no more "natural" value than |
@larsmans As I said in Malaga, I agree with you. It wasn't obvious to me why -1 is the best choice but we shouldn't over-complicate it ;) |
|
for clf in (BernoulliNB(), MultinomialNB()): | ||
semi_clf = SemisupervisedNB(clf, n_iter=20, tol=1e6) | ||
semi_clf.fit(X, y) | ||
assert_array_equal(semi_clf.predict([[5, 0, 0], [1, 1, 4]]), [1, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coverage report shows that this test is a bit two easy as the convergence is reached at the first iteration (old_coef
and old_intercept
are not updated). Also I would update this test to check that the input of a dense array X
and its CSR variant yield the same outcome (in terms of predicted proba for instance).
About the nested constructor issue: I am still not sold to the current API. Here is another proposal: what about making the |
This sounds good. I think it might be better to have a Semisupervised mixin class which can admit unlabeled data. |
As |
If I ever try this again I'll just rewrite the code, so I'm closing this PR. |
Hi @larsmans , I was looking at the previous enhancements in semi supervised learning and came across this. But this was wayback before I even knew Machine Learning :) I would like to try implementing this if it is okay. Please let me know your thoughts. Thanks. |
Here's the EM algorithm for semisupervised Naive Bayes. The implementation checks for convergence based on the coefficients following the advice of Bishop 2006, so it could be used more generally for linear classifier self-training, but I only implemented the necessary machinery (fitting on a 1-of-K vector) on the discrete NB estimators and we might want to switch to log-likelihood based convergence checking later.
Narrative documentation follows if there's interest in this pull request. I've adapted the document classification example script into a new, semisupervised example. We might also merge these scripts.