[MRG+2] Implement Complement Naive Bayes. #8190

airalcorn2 · 2017-01-12T17:02:12Z

Reference Issue

N/A

What does this implement/fix? Explain your changes.

Implements the Complement Naive Bayes (CNB) classifier described in Rennie et al. (2003). CNB was designed to correct the "severe assumptions" made by the standard Multinomial Naive Bayes (MNB) classifier. As a result, CNB often achieves considerably better results than MNB on text classification tasks with imbalanced classes (as can be seen below); so much so that Apache Mahout includes an implementation of CNB alongside its MNB classifier. With that being the case, it would be nice to have an easily usable CNB implementation also available in scikit-learn.

Any other comments?

Results from testing on Reuters-21578 (see example code).

<class 'sklearn.naive_bayes.MultinomialNB'>
Accuracy: 0.772
Weighted Precision: 0.735
Weighted Recall: 0.772

<class 'sklearn.naive_bayes.MultinomialCNB'>
Accuracy: 0.813
Weighted Precision: 0.805
Weighted Recall: 0.813

glemaitre · 2017-01-13T10:19:41Z

Just by curiosity, is CNB not equivalent to pipeline a tf-idf and an MNB?

airalcorn2 · 2017-01-13T12:56:52Z

@glemaitre - no, they are not equivalent. Compare equations (4) and (6) in the paper. For a given class, CNB estimates the parameters for the complement of the class. The authors suggest CNB produces weight estimates that are less biased and more stable (see Figure 1) than those produced by MNB.

glemaitre · 2017-01-13T13:15:18Z

@airalcorn2 yep, you're right, somebody is wrong on the wiki page of the MNB, omitting the part regarding the complement :)

jnothman

This needs unit tests for _count, whether based on an example / toy data, or checking that invariants are held for random/challenging data.

jnothman · 2017-05-27T12:10:41Z

sklearn/naive_bayes.py

@@ -708,6 +708,112 @@ def _joint_log_likelihood(self, X):
                self.class_log_prior_)


+class MultinomialCNB(BaseDiscreteNB):
+    """


PEP257: short description belongs here

Fixed. FYI, the short description for MultinomialNB is not PEP257 compliant (I was modifying a copied version of that class).

jnothman · 2017-05-27T12:16:33Z

sklearn/naive_bayes.py

+
+        for i in range(n_classes):
+            in_class = y == i
+            numerator = numerator_all - X[in_class].sum(axis=0)


FWIW, I think this sum can be performed vectorized with np.add.at (now that we only support numpy >= 1.8)

Could you elaborate? I took a look at np.add.at and wasn't able to figure out what you were suggesting.

This seems intuitive enough to me. I'm not immediately sure how one would use np.add.at either.

airalcorn2 · 2017-05-30T17:56:17Z

@jnothman - I added a unit test using a toy data set. Let me know if that's not adequate.

jmschrei · 2017-06-01T17:51:34Z

This is looking pretty good to me. Thanks for the contribution! I'll check back again later when the tests are all passing.

airalcorn2 · 2017-06-06T13:20:14Z

Looks like all the tests have passed, @jmschrei.

jmschrei · 2017-06-26T19:52:36Z

Apologies for the delay, I've been super busy recently. Can you look into the conflicts that have arisen, and I'll get back to you soon? Again, thanks for taking the time to contribute this, we really appreciate it.

airalcorn2 · 2017-06-27T15:50:01Z

@jmschrei - the estimator_checks.py file currently ignores/modifies several tests for the different naive Bayes classifiers because the assumptions of these classifiers don't mesh well with the data being tested. I've updated those same tests to account for the new Complement Naive Bayes classifier.

jmschrei · 2017-06-30T05:22:30Z

Hi @airalcorn2, the branch is still having problems, I'm guessing due to #9131 . If you can get this PR to all tests passing again, I have time to review it and hopefully we can get it merged soon!

airalcorn2 · 2017-07-01T04:02:09Z

@jmschrei - looks like everything's actually passing this time.

jmschrei · 2017-07-03T20:42:44Z

sklearn/naive_bayes.py

+        n_classes = len(self.classes_)
+        n_features = X.shape[1]
+        weights = np.zeros((n_classes, n_features), dtype=np.float64)
+        numerator_all = X.sum(axis=0) + self.alpha


Can you clarify what is going on here? My understanding is that your input will be discrete but not necessarily binary.

The code is hopefully a little clearer now. I'm implementing steps four through eight of the algorithm outlined in Table 4 of the paper. Basically, you count up the features (with a smoothing factor) for the complement of each class, normalize those counts, take the logarithm of these normalized counts, and then normalize again. The input matrix just needs to be non-negative (e.g., it can be either a tf-idf matrix or a raw term count matrix).

jmschrei · 2017-07-03T21:01:23Z

sklearn/naive_bayes.py

+        self.weights_ = weights
+
+    def _update_feature_log_prob(self, alpha):
+        self.feature_log_prob_ = self.weights_


Is alpha supposed to be ignored here? Here https://github.com/airalcorn2/scikit-learn/blob/47c436065840641989f50032aafc15a0335594ad/sklearn/naive_bayes.py#L712 we are only aggregating the sufficient statistics in _count, but then update parameters here.

I made several changes that should make the code more readable and more aligned with the _count and _update_feature_log_prob functions of the other naive Bayes classifiers.

jmschrei · 2017-07-03T21:03:48Z

You should also add in an entry to docs/whats_new.rst

airalcorn2 · 2017-07-06T14:19:06Z

@jmschrei - let me know what you think about the changes.

jmschrei · 2017-07-06T15:43:16Z

LGTM! Let's see if we can track down another reviewer, @jnothman @glemaitre maybe?

airalcorn2 · 2017-07-06T16:55:49Z

The feature_all_ attribute wasn't accounting for sample weights and also wasn't mentioned in the class docstring, so the most recent push makes those corrections.

airalcorn2 · 2017-07-11T19:40:08Z

Hey, @jmschrei. Do you know the probability/timeline of this being merged? We'd like to migrate our current Mahout Complement Naive Bayes process to Python, so I'm trying to figure out if we should just go with my fork. Thanks.

jmschrei · 2017-07-11T19:42:24Z

It's just waiting on another reviewer, perhaps @jnothman or @raghavrv or @glemaitre have time to take a look? It's not a very complicated model. Unfortunately, due to the velocity of PRs and issues being opened, we sometimes lose track of good contributions.

glemaitre · 2017-07-11T19:52:35Z

I will review it tonight.

glemaitre

I have the impression that there is some duplicated code between Multinomai CNB and NB (in _count and _joint_log_likelihood. @jmschrei Is it making sense to factorizing it?

glemaitre · 2017-07-11T22:26:49Z