-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] Implement Complement Naive Bayes. #8190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -91,6 +91,10 @@ Classifiers and regressors | |
during the first epochs of ridge and logistic regression. | ||
:issue:`8446` by `Arthur Mensch`_. | ||
|
||
- Added :class:`naive_bayes.ComplementNB`, which implements the Complement | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Argh! No! This is in the wrong place! |
||
Naive Bayes classifier described in Rennie et al. (2003). | ||
By :user:`Michael A. Alcorn <airalcorn2>`. | ||
|
||
Other estimators | ||
|
||
- Added the :class:`neighbors.LocalOutlierFactor` class for anomaly | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,7 +33,7 @@ | |
from .utils.validation import check_is_fitted | ||
from .externals import six | ||
|
||
__all__ = ['BernoulliNB', 'GaussianNB', 'MultinomialNB'] | ||
__all__ = ['BernoulliNB', 'GaussianNB', 'MultinomialNB', 'ComplementNB'] | ||
|
||
|
||
class BaseNB(six.with_metaclass(ABCMeta, BaseEstimator, ClassifierMixin)): | ||
|
@@ -726,6 +726,97 @@ def _joint_log_likelihood(self, X): | |
self.class_log_prior_) | ||
|
||
|
||
class ComplementNB(BaseDiscreteNB): | ||
"""The Complement Naive Bayes classifier described in Rennie et al. (2003). | ||
|
||
The Complement Naive Bayes classifier was designed to correct the "severe | ||
assumptions" made by the standard Multinomial Naive Bayes classifier. It is | ||
particularly suited for imbalanced data sets. | ||
|
||
Read more in the :ref:`User Guide <complement_naive_bayes>`. | ||
|
||
Parameters | ||
---------- | ||
alpha : float, optional (default=1.0) | ||
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). | ||
|
||
fit_prior : boolean, optional (default=True) | ||
Only used in edge case with a single class in the training set. | ||
|
||
class_prior : array-like, size (n_classes,), optional (default=None) | ||
Prior probabilities of the classes. Not used. | ||
|
||
Attributes | ||
---------- | ||
class_log_prior_ : array, shape (n_classes, ) | ||
Smoothed empirical log probability for each class. Only used in edge | ||
case with a single class in the training set. | ||
|
||
feature_log_prob_ : array, shape (n_classes, n_features) | ||
Empirical weights for class complements. | ||
|
||
class_count_ : array, shape (n_classes,) | ||
Number of samples encountered for each class during fitting. This | ||
value is weighted by the sample weight when provided. | ||
|
||
feature_count_ : array, shape (n_classes, n_features) | ||
Number of samples encountered for each (class, feature) during fitting. | ||
This value is weighted by the sample weight when provided. | ||
|
||
feature_all_ : array, shape (n_features,) | ||
Number of samples encountered for each feature during fitting. This | ||
value is weighted by the sample weight when provided. | ||
|
||
Examples | ||
-------- | ||
>>> import numpy as np | ||
>>> X = np.random.randint(5, size=(6, 100)) | ||
>>> y = np.array([1, 2, 3, 4, 5, 6]) | ||
>>> from sklearn.naive_bayes import ComplementNB | ||
>>> clf = ComplementNB() | ||
>>> clf.fit(X, y) | ||
ComplementNB(alpha=1.0, class_prior=None, fit_prior=True) | ||
>>> print(clf.predict(X[2:3])) | ||
[3] | ||
|
||
References | ||
---------- | ||
Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003). | ||
Tackling the poor assumptions of naive bayes text classifiers. In ICML | ||
(Vol. 3, pp. 616-623). | ||
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf | ||
""" | ||
|
||
def __init__(self, alpha=1.0, fit_prior=True, class_prior=None): | ||
self.alpha = alpha | ||
self.fit_prior = fit_prior | ||
self.class_prior = class_prior | ||
|
||
def _count(self, X, Y): | ||
"""Count feature occurrences.""" | ||
if np.any((X.data if issparse(X) else X) < 0): | ||
raise ValueError("Input X must be non-negative") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not tested There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a simple test to validate the counts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean that you don't currently test that this error is raised. I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jnothman - I added that test. |
||
self.feature_count_ += safe_sparse_dot(Y.T, X) | ||
self.class_count_ += Y.sum(axis=0) | ||
self.feature_all_ = self.feature_count_.sum(axis=0) | ||
|
||
def _update_feature_log_prob(self, alpha): | ||
"""Apply smoothing to raw counts and compute the weights.""" | ||
comp_count = self.feature_all_ + alpha - self.feature_count_ | ||
logged = np.log(comp_count / comp_count.sum(axis=1, keepdims=True)) | ||
self.feature_log_prob_ = logged / logged.sum(axis=1, keepdims=True) | ||
|
||
def _joint_log_likelihood(self, X): | ||
"""Calculate the class scores for the samples in X.""" | ||
check_is_fitted(self, "classes_") | ||
|
||
X = check_array(X, accept_sparse="csr") | ||
jll = safe_sparse_dot(X, self.feature_log_prob_.T) | ||
if len(self.classes_) == 1: | ||
jll += self.class_log_prior_ | ||
return jll | ||
|
||
|
||
class BernoulliNB(BaseDiscreteNB): | ||
"""Naive Bayes classifier for multivariate Bernoulli models. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had meant to check, but forgot: this does not compile.
Firstly, there should be
_
after all the\sum
s.Secondly, we at least need blank lines between successive equations.
Thirdly, I'm not sure about the
_i
on alpha: it is present here, but not in the next line. I should probably double-check this with respect to the implementation!And yet, I'm still getting TeX complaining of a runaway argument...
Are you able to check this and submit a PR to fix it?