Skip to content

[MRG+2] Implement Complement Naive Bayes. #8190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 28, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1081,6 +1081,7 @@ Model validation
naive_bayes.BernoulliNB
naive_bayes.GaussianNB
naive_bayes.MultinomialNB
naive_bayes.ComplementNB


.. _neighbors_ref:
Expand Down
40 changes: 40 additions & 0 deletions doc/modules/naive_bayes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,46 @@ in further computations.
Setting :math:`\alpha = 1` is called Laplace smoothing,
while :math:`\alpha < 1` is called Lidstone smoothing.

.. _complement_naive_bayes:

Complement Naive Bayes
-----------------------

:class:`ComplementNB` implements the complement naive Bayes (CNB) algorithm.
CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm
that is particularly suited for imbalanced data sets. Specifically, CNB uses
statistics from the *complement* of each class to compute the model's weights.
The inventors of CNB show empirically that the parameter estimates for CNB are
more stable than those for MNB. Further, CNB regularly outperforms MNB (often
by a considerable margin) on text classification tasks. The procedure for
calculating the weights is as follows:

.. math::

\hat{\theta}_{ci} = \frac{\sum{j:y_j \neq c} d_{ij} + \alpha_i}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant to check, but forgot: this does not compile.

Firstly, there should be _ after all the \sums.

Secondly, we at least need blank lines between successive equations.

Thirdly, I'm not sure about the _i on alpha: it is present here, but not in the next line. I should probably double-check this with respect to the implementation!

And yet, I'm still getting TeX complaining of a runaway argument...

Are you able to check this and submit a PR to fix it?

{\sum{j:y_j \neq c} \sum{k} d_{kj} + \alpha}
w_{ci} = \log \hat{\theta}_{ci}
w_{ci} = \frac{w_{ci}{\sum{j} w_{cj}}

where the summation is over all documents :math:`j` not in class :math:`c`,
:math:`d_{ij}` is either the count or tf-idf value of term :math:`i` in document
:math:`j`, and :math:`\alpha` is a smoothing hyperparameter like that found in
MNB. The second normalization addresses the tendency for longer documents to
dominate parameter estimates in MNB. The classification rule is:

.. math::

\hat{c} = \arg\min_c \sum{i} t_i w_{ci}

i.e., a document is assigned to the class that is the *poorest* complement
match.

.. topic:: References:

* Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003).
`Tackling the poor assumptions of naive bayes text classifiers.
<http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf>`_
In ICML (Vol. 3, pp. 616-623).

.. _bernoulli_naive_bayes:

Expand Down
4 changes: 4 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,10 @@ Classifiers and regressors
during the first epochs of ridge and logistic regression.
:issue:`8446` by `Arthur Mensch`_.

- Added :class:`naive_bayes.ComplementNB`, which implements the Complement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh! No! This is in the wrong place!

Naive Bayes classifier described in Rennie et al. (2003).
By :user:`Michael A. Alcorn <airalcorn2>`.

Other estimators

- Added the :class:`neighbors.LocalOutlierFactor` class for anomaly
Expand Down
3 changes: 2 additions & 1 deletion examples/text/document_classification_20newsgroups.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
Expand Down Expand Up @@ -283,6 +283,7 @@ def benchmark(clf):
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=.01)))
results.append(benchmark(BernoulliNB(alpha=.01)))
results.append(benchmark(ComplementNB(alpha=.1)))

print('=' * 80)
print("LinearSVC with L1-based feature selection")
Expand Down
93 changes: 92 additions & 1 deletion sklearn/naive_bayes.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
from .utils.validation import check_is_fitted
from .externals import six

__all__ = ['BernoulliNB', 'GaussianNB', 'MultinomialNB']
__all__ = ['BernoulliNB', 'GaussianNB', 'MultinomialNB', 'ComplementNB']


class BaseNB(six.with_metaclass(ABCMeta, BaseEstimator, ClassifierMixin)):
Expand Down Expand Up @@ -726,6 +726,97 @@ def _joint_log_likelihood(self, X):
self.class_log_prior_)


class ComplementNB(BaseDiscreteNB):
"""The Complement Naive Bayes classifier described in Rennie et al. (2003).

The Complement Naive Bayes classifier was designed to correct the "severe
assumptions" made by the standard Multinomial Naive Bayes classifier. It is
particularly suited for imbalanced data sets.

Read more in the :ref:`User Guide <complement_naive_bayes>`.

Parameters
----------
alpha : float, optional (default=1.0)
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

fit_prior : boolean, optional (default=True)
Only used in edge case with a single class in the training set.

class_prior : array-like, size (n_classes,), optional (default=None)
Prior probabilities of the classes. Not used.

Attributes
----------
class_log_prior_ : array, shape (n_classes, )
Smoothed empirical log probability for each class. Only used in edge
case with a single class in the training set.

feature_log_prob_ : array, shape (n_classes, n_features)
Empirical weights for class complements.

class_count_ : array, shape (n_classes,)
Number of samples encountered for each class during fitting. This
value is weighted by the sample weight when provided.

feature_count_ : array, shape (n_classes, n_features)
Number of samples encountered for each (class, feature) during fitting.
This value is weighted by the sample weight when provided.

feature_all_ : array, shape (n_features,)
Number of samples encountered for each feature during fitting. This
value is weighted by the sample weight when provided.

Examples
--------
>>> import numpy as np
>>> X = np.random.randint(5, size=(6, 100))
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> from sklearn.naive_bayes import ComplementNB
>>> clf = ComplementNB()
>>> clf.fit(X, y)
ComplementNB(alpha=1.0, class_prior=None, fit_prior=True)
>>> print(clf.predict(X[2:3]))
[3]

References
----------
Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003).
Tackling the poor assumptions of naive bayes text classifiers. In ICML
(Vol. 3, pp. 616-623).
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
"""

def __init__(self, alpha=1.0, fit_prior=True, class_prior=None):
self.alpha = alpha
self.fit_prior = fit_prior
self.class_prior = class_prior

def _count(self, X, Y):
"""Count feature occurrences."""
if np.any((X.data if issparse(X) else X) < 0):
raise ValueError("Input X must be non-negative")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a simple test to validate the counts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you don't currently test that this error is raised. I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman - I added that test.

self.feature_count_ += safe_sparse_dot(Y.T, X)
self.class_count_ += Y.sum(axis=0)
self.feature_all_ = self.feature_count_.sum(axis=0)

def _update_feature_log_prob(self, alpha):
"""Apply smoothing to raw counts and compute the weights."""
comp_count = self.feature_all_ + alpha - self.feature_count_
logged = np.log(comp_count / comp_count.sum(axis=1, keepdims=True))
self.feature_log_prob_ = logged / logged.sum(axis=1, keepdims=True)

def _joint_log_likelihood(self, X):
"""Calculate the class scores for the samples in X."""
check_is_fitted(self, "classes_")

X = check_array(X, accept_sparse="csr")
jll = safe_sparse_dot(X, self.feature_log_prob_.T)
if len(self.classes_) == 1:
jll += self.class_log_prior_
return jll


class BernoulliNB(BaseDiscreteNB):
"""Naive Bayes classifier for multivariate Bernoulli models.

Expand Down
67 changes: 66 additions & 1 deletion sklearn/tests/test_naive_bayes.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from __future__ import division

import pickle
from io import BytesIO
import numpy as np
Expand All @@ -18,7 +20,8 @@
from sklearn.utils.testing import assert_greater
from sklearn.utils.testing import assert_warns

from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.naive_bayes import MultinomialNB, ComplementNB

# Data is just 6 separable points in the plane
X = np.array([[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]])
Expand Down Expand Up @@ -530,6 +533,68 @@ def test_bnb():
assert_array_almost_equal(clf.predict_proba(X_test), predict_proba)


def test_cnb():
# Tests ComplementNB when alpha=1.0 for the toy example in Manning,
# Raghavan, and Schuetze's "Introduction to Information Retrieval" book:
# http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html

# Training data points are:
# Chinese Beijing Chinese (class: China)
# Chinese Chinese Shanghai (class: China)
# Chinese Macao (class: China)
# Tokyo Japan Chinese (class: Japan)

# Features are Beijing, Chinese, Japan, Macao, Shanghai, and Tokyo.
X = np.array([[1, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 1, 0],
[0, 1, 0, 1, 0, 0],
[0, 1, 1, 0, 0, 1]])

# Classes are China (0), Japan (1).
Y = np.array([0, 0, 0, 1])

# Verify inputs are nonnegative.
clf = ComplementNB(alpha=1.0)
assert_raises(ValueError, clf.fit, -X, Y)

clf.fit(X, Y)

# Check that counts are correct.
feature_count = np.array([[1, 3, 0, 1, 1, 0], [0, 1, 1, 0, 0, 1]])
assert_array_equal(clf.feature_count_, feature_count)
class_count = np.array([3, 1])
assert_array_equal(clf.class_count_, class_count)
feature_all = np.array([1, 4, 1, 1, 1, 1])
assert_array_equal(clf.feature_all_, feature_all)

# Check that weights are correct. See steps 4-6 in Table 4 of
# Rennie et al. (2003).
theta = np.array([
[
(0 + 1) / (3 + 6),
(1 + 1) / (3 + 6),
(1 + 1) / (3 + 6),
(0 + 1) / (3 + 6),
(0 + 1) / (3 + 6),
(1 + 1) / (3 + 6)
],
[
(1 + 1) / (6 + 6),
(3 + 1) / (6 + 6),
(0 + 1) / (6 + 6),
(1 + 1) / (6 + 6),
(1 + 1) / (6 + 6),
(0 + 1) / (6 + 6)
]])

weights = np.zeros(theta.shape)
for i in range(2):
weights[i] = np.log(theta[i])
weights[i] /= weights[i].sum()

assert_array_equal(clf.feature_log_prob_, weights)


def test_naive_bayes_scale_invariance():
# Scaling the data should not change the prediction results
iris = load_iris()
Expand Down
18 changes: 9 additions & 9 deletions sklearn/utils/estimator_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,12 +115,12 @@ def _yield_classifier_checks(name, classifier):
# basic consistency testing
yield check_classifiers_train
yield check_classifiers_regression_target
if (name not in
["MultinomialNB", "LabelPropagation", "LabelSpreading"] and
if (name not in ["MultinomialNB", "ComplementNB", "LabelPropagation",
"LabelSpreading"] and
# TODO some complication with -1 label
name not in ["DecisionTreeClassifier", "ExtraTreeClassifier"]):
# We don't raise a warning in these classifiers, as
# the column y interface is used by the forests.
name not in ["DecisionTreeClassifier", "ExtraTreeClassifier"]):
# We don't raise a warning in these classifiers, as
# the column y interface is used by the forests.

yield check_supervised_y_2d
# test if NotFittedError is raised
Expand Down Expand Up @@ -1088,7 +1088,7 @@ def check_classifiers_train(name, classifier_orig):
n_classes = len(classes)
n_samples, n_features = X.shape
classifier = clone(classifier_orig)
if name in ['BernoulliNB', 'MultinomialNB']:
if name in ['BernoulliNB', 'MultinomialNB', 'ComplementNB']:
X -= X.min()
set_random_state(classifier)
# raises error on malformed input for fit
Expand All @@ -1102,7 +1102,7 @@ def check_classifiers_train(name, classifier_orig):
y_pred = classifier.predict(X)
assert_equal(y_pred.shape, (n_samples,))
# training set performance
if name not in ['BernoulliNB', 'MultinomialNB']:
if name not in ['BernoulliNB', 'MultinomialNB', 'ComplementNB']:
assert_greater(accuracy_score(y, y_pred), 0.83)

# raises error on malformed input for predict
Expand Down Expand Up @@ -1245,8 +1245,8 @@ def check_classifiers_classes(name, classifier_orig):

classes = np.unique(y_)
classifier = clone(classifier_orig)
if name == 'BernoulliNB':
classifier.set_params(binarize=X.mean())
if name in ['BernoulliNB', 'ComplementNB']:
X = X > X.mean()
set_random_state(classifier)
# fit
classifier.fit(X, y_)
Expand Down