-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Implement Complement Naive Bayes. #8190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just by curiosity, is CNB not equivalent to pipeline a tf-idf and an MNB? |
@glemaitre - no, they are not equivalent. Compare equations (4) and (6) in the paper. For a given class, CNB estimates the parameters for the complement of the class. The authors suggest CNB produces weight estimates that are less biased and more stable (see Figure 1) than those produced by MNB. |
@airalcorn2 yep, you're right, somebody is wrong on the wiki page of the MNB, omitting the part regarding the complement :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs unit tests for _count
, whether based on an example / toy data, or checking that invariants are held for random/challenging data.
sklearn/naive_bayes.py
Outdated
@@ -708,6 +708,112 @@ def _joint_log_likelihood(self, X): | |||
self.class_log_prior_) | |||
|
|||
|
|||
class MultinomialCNB(BaseDiscreteNB): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP257: short description belongs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. FYI, the short description for MultinomialNB
is not PEP257 compliant (I was modifying a copied version of that class).
sklearn/naive_bayes.py
Outdated
|
||
for i in range(n_classes): | ||
in_class = y == i | ||
numerator = numerator_all - X[in_class].sum(axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think this sum can be performed vectorized with np.add.at
(now that we only support numpy >= 1.8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? I took a look at np.add.at
and wasn't able to figure out what you were suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems intuitive enough to me. I'm not immediately sure how one would use np.add.at
either.
@jnothman - I added a unit test using a toy data set. Let me know if that's not adequate. |
This is looking pretty good to me. Thanks for the contribution! I'll check back again later when the tests are all passing. |
Looks like all the tests have passed, @jmschrei. |
Apologies for the delay, I've been super busy recently. Can you look into the conflicts that have arisen, and I'll get back to you soon? Again, thanks for taking the time to contribute this, we really appreciate it. |
@jmschrei - the |
Hi @airalcorn2, the branch is still having problems, I'm guessing due to #9131 . If you can get this PR to all tests passing again, I have time to review it and hopefully we can get it merged soon! |
@jmschrei - looks like everything's actually passing this time. |
sklearn/naive_bayes.py
Outdated
n_classes = len(self.classes_) | ||
n_features = X.shape[1] | ||
weights = np.zeros((n_classes, n_features), dtype=np.float64) | ||
numerator_all = X.sum(axis=0) + self.alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what is going on here? My understanding is that your input will be discrete but not necessarily binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is hopefully a little clearer now. I'm implementing steps four through eight of the algorithm outlined in Table 4 of the paper. Basically, you count up the features (with a smoothing factor) for the complement of each class, normalize those counts, take the logarithm of these normalized counts, and then normalize again. The input matrix just needs to be non-negative (e.g., it can be either a tf-idf matrix or a raw term count matrix).
sklearn/naive_bayes.py
Outdated
self.weights_ = weights | ||
|
||
def _update_feature_log_prob(self, alpha): | ||
self.feature_log_prob_ = self.weights_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is alpha supposed to be ignored here? Here https://github.com/airalcorn2/scikit-learn/blob/47c436065840641989f50032aafc15a0335594ad/sklearn/naive_bayes.py#L712 we are only aggregating the sufficient statistics in _count
, but then update parameters here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made several changes that should make the code more readable and more aligned with the _count
and _update_feature_log_prob
functions of the other naive Bayes classifiers.
You should also add in an entry to docs/whats_new.rst |
@jmschrei - let me know what you think about the changes. |
LGTM! Let's see if we can track down another reviewer, @jnothman @glemaitre maybe? |
The |
Hey, @jmschrei. Do you know the probability/timeline of this being merged? We'd like to migrate our current Mahout Complement Naive Bayes process to Python, so I'm trying to figure out if we should just go with my fork. Thanks. |
It's just waiting on another reviewer, perhaps @jnothman or @raghavrv or @glemaitre have time to take a look? It's not a very complicated model. Unfortunately, due to the velocity of PRs and issues being opened, we sometimes lose track of good contributions. |
I will review it tonight.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that there is some duplicated code between Multinomai CNB and NB (in _count
and _joint_log_likelihood
. @jmschrei Is it making sense to factorizing it?
@@ -0,0 +1,123 @@ | |||
""" | |||
================================ | |||
Comparing Complement Naive Bayes to standard Naive Bayes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicking -> can you add ===
until the end of the title and removed the final full stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
================================ | ||
|
||
An example showing how MultinomialCNB outperforms MultinomialNB | ||
on imbalanced text classification tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a hard time to distinguish MultinomaiCNB from MultinomalNB. We might want to introduce the full name instead of the class name at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name changed to ComplementNB
on @jnothman's suggestion.
from sklearn.metrics import precision_score, recall_score | ||
from sklearn.naive_bayes import MultinomialCNB, MultinomialNB | ||
|
||
# Some setup code taken from "Classifying Reuters-21578 collection with Python" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put this comment in the first docstring as a note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed since I'm no longer using Reuters.
print("Weighted Recall: {0:.3f}\n".format(recall)) | ||
|
||
""" | ||
<class 'sklearn.naive_bayes.MultinomialNB'> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this part since it will be executed anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
print("Weighted Recall: {0:.3f}\n".format(recalls[model][sublinear])) | ||
|
||
""" | ||
<class 'sklearn.naive_bayes.MultinomialNB'> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
||
# Some setup code taken from "Classifying Reuters-21578 collection with Python" | ||
# on https://miguelmalvarez.com. | ||
random.seed(2010) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we usually used np.random.RandomState
instead of random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rng = np.random.RandomState(2010)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
import numpy as np | ||
import random | ||
|
||
from nltk.corpus import reuters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed NLTK. I was using the Reuters data set because it's imbalanced, but 20 Newsgroups also demonstrates the superiority of CNB.
train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id] | ||
test_docs = [reuters.raw(doc_id) for doc_id in test_docs_id] | ||
|
||
train_labels = [random.choice(reuters.categories(doc_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random.choice
could be replace by rng.choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
recalls = {} | ||
|
||
for model in [MultinomialNB, MultinomialCNB]: | ||
accuracies[model] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is looking a bit weird to me to have a class as a key of dict. @jnothman can we do that, I am intrigued.
precisions[model] = {} | ||
recalls[model] = {} | ||
|
||
clf = model().fit(X_train_counts, train_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have pass to instance probably instead of the class. It will depends of the dictionary key remark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are currently no narrative docs in doc/modules/naive_bayes.rst
@@ -0,0 +1,123 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filename is very non-standard. It should be plot_
something, and ideally, you should plot something!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name changed and plot added.
sklearn/naive_bayes.py
Outdated
@@ -726,6 +726,95 @@ def _joint_log_likelihood(self, X): | |||
self.class_log_prior_) | |||
|
|||
|
|||
class MultinomialCNB(BaseDiscreteNB): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it just be ComplementNB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me.
sklearn/naive_bayes.py
Outdated
The Complement Naive Bayes classifier was designed to correct the "severe | ||
assumptions" made by the standard Multinomial Naive Bayes classifier. See | ||
Rennie et al. (2003) for further discussion. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should say "Read more in User Guide" as most other classes do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
I think this is a well known and useful NB variant. But I don't think the contribution is yet meeting our standards in terms of documentation at least. When it will be merged? Perhaps September. When it will be released? Perhaps April 2018. |
Btw, that merge estimate might be pessimistic, and the release estimate optimistic. Hard to say. |
Thanks for the review, @jnothman and @glemaitre. Tried to incorporate all of your feedback in the latest push. Also added narrative documentation to |
Scikit-learn does have a fetcher for Reuters btw: fetch_rcv1
On 15 Jul 2017 6:36 am, "Michael A. Alcorn" <notifications@github.com> wrote:
Thanks for the review, @jnothman <https://github.com/jnothman> and
@glemaitre <https://github.com/glemaitre>. Tried to incorporate all of your
feedback in the latest push. Also added narrative documentation to
doc/modules/naive_bayes.rst.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8190 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-svdSOnrVgy8bMrcttrQ6rDRWuwks5sN9E0gaJpZM4Lh_gU>
.
|
Y'all mind taking another look, @jnothman and @glemaitre? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise this LGTM
doc/modules/naive_bayes.rst
Outdated
----------------------- | ||
|
||
:class:`ComplementNB` implements the complement naive Bayes (CNB) algorithm. | ||
CNB is an adaption of the standard multinomial naive Bayes (MNB) algorithm that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"adaption" -> "adaptation"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -0,0 +1,71 @@ | |||
""" | |||
=========================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just add to examples/text/document_classification_20newsgroups.py
? This code doesn't seem to illustrate anything more specific to MNB/CMNB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me.
def _count(self, X, Y): | ||
"""Count feature occurrences.""" | ||
if np.any((X.data if issparse(X) else X) < 0): | ||
raise ValueError("Input X must be non-negative") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a simple test to validate the counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that you don't currently test that this error is raised. I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman - I added that test.
sklearn/tests/test_naive_bayes.py
Outdated
# Rennie et al. (2003). | ||
theta = np.array([ | ||
[ | ||
(0 + 1) / float(3 + 6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in tests I'd much prefer a __future__
import over ugly casts...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just changed some of the numbers to floats. A __future__
import could cause unexpected behavior in some of the other tests.
Let me know if there's anything else, @jnothman. |
A future import couldn't cause unexpected behaviour in other tests (unless they have explicit switches for py2 vs 3, which they don't) because we test on both versions |
@jnothman - ah, right. Added the |
Please don't squash your commits. It makes it very hard for me to work out what code "I added a simple test to validate the counts." refers to. As it is, coveralls thinks the line still lacks coverage, and I can't see an assert_raises or similar in your code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs adding to doc/modules/classes.rst
sklearn/tests/test_naive_bayes.py
Outdated
@@ -553,8 +553,10 @@ def test_cnb(): | |||
# Classes are China (0), Japan (1). | |||
Y = np.array([0, 0, 0, 1]) | |||
|
|||
# Fit ComplementNB w/ alpha = 1.0. | |||
# Check the ability to predict the learning set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clearly doesn't apply to the assertion you've just added in.
@jnothman - added |
Now I'm okay with this. My only concern is that I'm not sure that this is much used in practice, and I keep seeing papers using MNB. Perhaps that's because it's not in scikit-learn? |
@jnothman - that was my feeling; hence, the pull request! Is there anything else you need from me (e.g., following up)? |
Merging, thanks @airalcorn2! |
@@ -91,6 +91,10 @@ Classifiers and regressors | |||
during the first epochs of ridge and logistic regression. | |||
:issue:`8446` by `Arthur Mensch`_. | |||
|
|||
- Added :class:`naive_bayes.ComplementNB`, which implements the Complement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh! No! This is in the wrong place!
|
||
.. math:: | ||
|
||
\hat{\theta}_{ci} = \frac{\sum{j:y_j \neq c} d_{ij} + \alpha_i} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had meant to check, but forgot: this does not compile.
Firstly, there should be _
after all the \sum
s.
Secondly, we at least need blank lines between successive equations.
Thirdly, I'm not sure about the _i
on alpha: it is present here, but not in the next line. I should probably double-check this with respect to the implementation!
And yet, I'm still getting TeX complaining of a runaway argument...
Are you able to check this and submit a PR to fix it?
Reference Issue
N/A
What does this implement/fix? Explain your changes.
Implements the Complement Naive Bayes (CNB) classifier described in Rennie et al. (2003). CNB was designed to correct the "severe assumptions" made by the standard Multinomial Naive Bayes (MNB) classifier. As a result, CNB often achieves considerably better results than MNB on text classification tasks with imbalanced classes (as can be seen below); so much so that Apache Mahout includes an implementation of CNB alongside its MNB classifier. With that being the case, it would be nice to have an easily usable CNB implementation also available in scikit-learn.
Any other comments?
Results from testing on Reuters-21578 (see example code).