Skip to content

[MRG+2] Implement Complement Naive Bayes. #8190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 28, 2017
Merged

[MRG+2] Implement Complement Naive Bayes. #8190

merged 3 commits into from
Aug 28, 2017

Conversation

airalcorn2
Copy link
Contributor

@airalcorn2 airalcorn2 commented Jan 12, 2017

Reference Issue

N/A

What does this implement/fix? Explain your changes.

Implements the Complement Naive Bayes (CNB) classifier described in Rennie et al. (2003). CNB was designed to correct the "severe assumptions" made by the standard Multinomial Naive Bayes (MNB) classifier. As a result, CNB often achieves considerably better results than MNB on text classification tasks with imbalanced classes (as can be seen below); so much so that Apache Mahout includes an implementation of CNB alongside its MNB classifier. With that being the case, it would be nice to have an easily usable CNB implementation also available in scikit-learn.

Any other comments?

Results from testing on Reuters-21578 (see example code).

<class 'sklearn.naive_bayes.MultinomialNB'>
Accuracy: 0.772
Weighted Precision: 0.735
Weighted Recall: 0.772

<class 'sklearn.naive_bayes.MultinomialCNB'>
Accuracy: 0.813
Weighted Precision: 0.805
Weighted Recall: 0.813

@glemaitre
Copy link
Member

Just by curiosity, is CNB not equivalent to pipeline a tf-idf and an MNB?

@airalcorn2
Copy link
Contributor Author

airalcorn2 commented Jan 13, 2017

@glemaitre - no, they are not equivalent. Compare equations (4) and (6) in the paper. For a given class, CNB estimates the parameters for the complement of the class. The authors suggest CNB produces weight estimates that are less biased and more stable (see Figure 1) than those produced by MNB.

@glemaitre
Copy link
Member

@airalcorn2 yep, you're right, somebody is wrong on the wiki page of the MNB, omitting the part regarding the complement :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs unit tests for _count, whether based on an example / toy data, or checking that invariants are held for random/challenging data.

@@ -708,6 +708,112 @@ def _joint_log_likelihood(self, X):
self.class_log_prior_)


class MultinomialCNB(BaseDiscreteNB):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP257: short description belongs here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. FYI, the short description for MultinomialNB is not PEP257 compliant (I was modifying a copied version of that class).


for i in range(n_classes):
in_class = y == i
numerator = numerator_all - X[in_class].sum(axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think this sum can be performed vectorized with np.add.at (now that we only support numpy >= 1.8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? I took a look at np.add.at and wasn't able to figure out what you were suggesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems intuitive enough to me. I'm not immediately sure how one would use np.add.at either.

@airalcorn2
Copy link
Contributor Author

@jnothman - I added a unit test using a toy data set. Let me know if that's not adequate.

@jmschrei
Copy link
Member

jmschrei commented Jun 1, 2017

This is looking pretty good to me. Thanks for the contribution! I'll check back again later when the tests are all passing.

@airalcorn2
Copy link
Contributor Author

airalcorn2 commented Jun 6, 2017

Looks like all the tests have passed, @jmschrei.

@jmschrei
Copy link
Member

Apologies for the delay, I've been super busy recently. Can you look into the conflicts that have arisen, and I'll get back to you soon? Again, thanks for taking the time to contribute this, we really appreciate it.

@airalcorn2
Copy link
Contributor Author

airalcorn2 commented Jun 27, 2017

@jmschrei - the estimator_checks.py file currently ignores/modifies several tests for the different naive Bayes classifiers because the assumptions of these classifiers don't mesh well with the data being tested. I've updated those same tests to account for the new Complement Naive Bayes classifier.

@jmschrei
Copy link
Member

jmschrei commented Jun 30, 2017

Hi @airalcorn2, the branch is still having problems, I'm guessing due to #9131 . If you can get this PR to all tests passing again, I have time to review it and hopefully we can get it merged soon!

@airalcorn2
Copy link
Contributor Author

@jmschrei - looks like everything's actually passing this time.

n_classes = len(self.classes_)
n_features = X.shape[1]
weights = np.zeros((n_classes, n_features), dtype=np.float64)
numerator_all = X.sum(axis=0) + self.alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what is going on here? My understanding is that your input will be discrete but not necessarily binary.

Copy link
Contributor Author

@airalcorn2 airalcorn2 Jul 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is hopefully a little clearer now. I'm implementing steps four through eight of the algorithm outlined in Table 4 of the paper. Basically, you count up the features (with a smoothing factor) for the complement of each class, normalize those counts, take the logarithm of these normalized counts, and then normalize again. The input matrix just needs to be non-negative (e.g., it can be either a tf-idf matrix or a raw term count matrix).

self.weights_ = weights

def _update_feature_log_prob(self, alpha):
self.feature_log_prob_ = self.weights_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is alpha supposed to be ignored here? Here https://github.com/airalcorn2/scikit-learn/blob/47c436065840641989f50032aafc15a0335594ad/sklearn/naive_bayes.py#L712 we are only aggregating the sufficient statistics in _count, but then update parameters here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made several changes that should make the code more readable and more aligned with the _count and _update_feature_log_prob functions of the other naive Bayes classifiers.

@jmschrei
Copy link
Member

jmschrei commented Jul 3, 2017

You should also add in an entry to docs/whats_new.rst

@airalcorn2
Copy link
Contributor Author

airalcorn2 commented Jul 6, 2017

@jmschrei - let me know what you think about the changes.

@jmschrei
Copy link
Member

jmschrei commented Jul 6, 2017

LGTM! Let's see if we can track down another reviewer, @jnothman @glemaitre maybe?

@jmschrei jmschrei changed the title [MRG] Implement Complement Naive Bayes. [MRG+1] Implement Complement Naive Bayes. Jul 6, 2017
@airalcorn2
Copy link
Contributor Author

The feature_all_ attribute wasn't accounting for sample weights and also wasn't mentioned in the class docstring, so the most recent push makes those corrections.

@airalcorn2
Copy link
Contributor Author

Hey, @jmschrei. Do you know the probability/timeline of this being merged? We'd like to migrate our current Mahout Complement Naive Bayes process to Python, so I'm trying to figure out if we should just go with my fork. Thanks.

@jmschrei
Copy link
Member

It's just waiting on another reviewer, perhaps @jnothman or @raghavrv or @glemaitre have time to take a look? It's not a very complicated model. Unfortunately, due to the velocity of PRs and issues being opened, we sometimes lose track of good contributions.

@glemaitre
Copy link
Member

glemaitre commented Jul 11, 2017 via email

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that there is some duplicated code between Multinomai CNB and NB (in _count and _joint_log_likelihood. @jmschrei Is it making sense to factorizing it?

@@ -0,0 +1,123 @@
"""
================================
Comparing Complement Naive Bayes to standard Naive Bayes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking -> can you add === until the end of the title and removed the final full stop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

================================

An example showing how MultinomialCNB outperforms MultinomialNB
on imbalanced text classification tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time to distinguish MultinomaiCNB from MultinomalNB. We might want to introduce the full name instead of the class name at first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name changed to ComplementNB on @jnothman's suggestion.

from sklearn.metrics import precision_score, recall_score
from sklearn.naive_bayes import MultinomialCNB, MultinomialNB

# Some setup code taken from "Classifying Reuters-21578 collection with Python"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put this comment in the first docstring as a note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since I'm no longer using Reuters.

print("Weighted Recall: {0:.3f}\n".format(recall))

"""
<class 'sklearn.naive_bayes.MultinomialNB'>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this part since it will be executed anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

print("Weighted Recall: {0:.3f}\n".format(recalls[model][sublinear]))

"""
<class 'sklearn.naive_bayes.MultinomialNB'>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


# Some setup code taken from "Classifying Reuters-21578 collection with Python"
# on https://miguelmalvarez.com.
random.seed(2010)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we usually used np.random.RandomState instead of random.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rng = np.random.RandomState(2010)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

import numpy as np
import random

from nltk.corpus import reuters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm not sure that we can add nltk as dependency even in the examples.
@jmschrei @jnothman Can you confirm my thoughts or this is a dependency that I am not aware of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed NLTK. I was using the Reuters data set because it's imbalanced, but 20 Newsgroups also demonstrates the superiority of CNB.

train_docs = [reuters.raw(doc_id) for doc_id in train_docs_id]
test_docs = [reuters.raw(doc_id) for doc_id in test_docs_id]

train_labels = [random.choice(reuters.categories(doc_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random.choice could be replace by rng.choice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

recalls = {}

for model in [MultinomialNB, MultinomialCNB]:
accuracies[model] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is looking a bit weird to me to have a class as a key of dict. @jnothman can we do that, I am intrigued.

precisions[model] = {}
recalls[model] = {}

clf = model().fit(X_train_counts, train_labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have pass to instance probably instead of the class. It will depends of the dictionary key remark.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are currently no narrative docs in doc/modules/naive_bayes.rst

@@ -0,0 +1,123 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filename is very non-standard. It should be plot_ something, and ideally, you should plot something!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name changed and plot added.

@@ -726,6 +726,95 @@ def _joint_log_likelihood(self, X):
self.class_log_prior_)


class MultinomialCNB(BaseDiscreteNB):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it just be ComplementNB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

The Complement Naive Bayes classifier was designed to correct the "severe
assumptions" made by the standard Multinomial Naive Bayes classifier. See
Rennie et al. (2003) for further discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say "Read more in User Guide" as most other classes do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@jnothman
Copy link
Member

@airalcorn2:

Do you know the probability/timeline of this being merged?

I think this is a well known and useful NB variant. But I don't think the contribution is yet meeting our standards in terms of documentation at least. When it will be merged? Perhaps September. When it will be released? Perhaps April 2018.

@jnothman
Copy link
Member

Btw, that merge estimate might be pessimistic, and the release estimate optimistic. Hard to say.

@airalcorn2
Copy link
Contributor Author

Thanks for the review, @jnothman and @glemaitre. Tried to incorporate all of your feedback in the latest push. Also added narrative documentation to doc/modules/naive_bayes.rst.

@jnothman
Copy link
Member

jnothman commented Jul 15, 2017 via email

@airalcorn2
Copy link
Contributor Author

Y'all mind taking another look, @jnothman and @glemaitre?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise this LGTM

-----------------------

:class:`ComplementNB` implements the complement naive Bayes (CNB) algorithm.
CNB is an adaption of the standard multinomial naive Bayes (MNB) algorithm that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"adaption" -> "adaptation"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -0,0 +1,71 @@
"""
===========================================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just add to examples/text/document_classification_20newsgroups.py? This code doesn't seem to illustrate anything more specific to MNB/CMNB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

def _count(self, X, Y):
"""Count feature occurrences."""
if np.any((X.data if issparse(X) else X) < 0):
raise ValueError("Input X must be non-negative")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a simple test to validate the counts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you don't currently test that this error is raised. I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman - I added that test.

# Rennie et al. (2003).
theta = np.array([
[
(0 + 1) / float(3 + 6),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in tests I'd much prefer a __future__ import over ugly casts...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed some of the numbers to floats. A __future__ import could cause unexpected behavior in some of the other tests.

@airalcorn2
Copy link
Contributor Author

Let me know if there's anything else, @jnothman.

@jnothman
Copy link
Member

jnothman commented Aug 8, 2017

A future import couldn't cause unexpected behaviour in other tests (unless they have explicit switches for py2 vs 3, which they don't) because we test on both versions

@airalcorn2
Copy link
Contributor Author

@jnothman - ah, right. Added the __future__ import.

@jnothman
Copy link
Member

Please don't squash your commits. It makes it very hard for me to work out what code "I added a simple test to validate the counts." refers to. As it is, coveralls thinks the line still lacks coverage, and I can't see an assert_raises or similar in your code.

@airalcorn2
Copy link
Contributor Author

@jnothman - maybe I'm not understanding what you're asking for? The only other place where I saw "count" show up in test_naive_bayes.py is here. The tests I added are here.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs adding to doc/modules/classes.rst

@@ -553,8 +553,10 @@ def test_cnb():
# Classes are China (0), Japan (1).
Y = np.array([0, 0, 0, 1])

# Fit ComplementNB w/ alpha = 1.0.
# Check the ability to predict the learning set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clearly doesn't apply to the assertion you've just added in.

@jnothman jnothman changed the title [MRG+1] Implement Complement Naive Bayes. [MRG+2] Implement Complement Naive Bayes. Aug 17, 2017
@airalcorn2
Copy link
Contributor Author

@jnothman - added ComplementNB to doc/modules/classes.rst and fixed the wording of the test comment.

@jnothman
Copy link
Member

Now I'm okay with this. My only concern is that I'm not sure that this is much used in practice, and I keep seeing papers using MNB. Perhaps that's because it's not in scikit-learn?

@airalcorn2
Copy link
Contributor Author

Perhaps that's because it's not in scikit-learn?

@jnothman - that was my feeling; hence, the pull request! Is there anything else you need from me (e.g., following up)?

@jnothman
Copy link
Member

Merging, thanks @airalcorn2!

@jnothman jnothman merged commit a571b01 into scikit-learn:master Aug 28, 2017
@@ -91,6 +91,10 @@ Classifiers and regressors
during the first epochs of ridge and logistic regression.
:issue:`8446` by `Arthur Mensch`_.

- Added :class:`naive_bayes.ComplementNB`, which implements the Complement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh! No! This is in the wrong place!


.. math::

\hat{\theta}_{ci} = \frac{\sum{j:y_j \neq c} d_{ij} + \alpha_i}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant to check, but forgot: this does not compile.

Firstly, there should be _ after all the \sums.

Secondly, we at least need blank lines between successive equations.

Thirdly, I'm not sure about the _i on alpha: it is present here, but not in the next line. I should probably double-check this with respect to the implementation!

And yet, I'm still getting TeX complaining of a runaway argument...

Are you able to check this and submit a PR to fix it?

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
@airalcorn2 airalcorn2 deleted the cnb branch August 29, 2017 15:32
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants