Skip to content

[MRG+1] FEA Add Categorical Naive Bayes #12569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 87 commits into from
Sep 23, 2019

Conversation

timbicker
Copy link
Contributor

@timbicker timbicker commented Nov 12, 2018

Reference Issues/PRs

Impelements Categorical NB from #10856

What does this implement/fix? Explain your changes.

This implementation adds the NB classifier for categorical features.

Any other comments?

  • implement categorical NB functionality
  • check and write doc strings
  • add full test coverage
  • write a documentation

@timbicker
Copy link
Contributor Author

There might be a problem, when the input array X of the predict function has unseen categories.
From a mathematical point of view, the probability of an unseen category is 0. Therefore the likelihood term of Bayes Theorem becomes 0 and probability of the sample is 0.
This could be an unwanted error because the user might expect that all possible categories are in the training set. It could also be something the user wants to know about, but he still wants to use the prediction. Or he simply does not care.
Therefore I added a new attribute on_unseen_cats, which controls if the classifier raises an error, a warning or does ignore it.

@FlorianWilhelm
Copy link
Contributor

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

@timbicker
Copy link
Contributor Author

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

Yes, I agree. This makes more sense in my opinion.

For the other case, that the category is unseen for a subset of the classes, we have the smoothing parameter.

@timbicker
Copy link
Contributor Author

The problem is that check_array(X, accept_sparse=False, force_all_finite=True, dtype='int') does first change the dtype of of X and then checks for nan or inf. This way, nan and inf are converted to an integer and no error is raised anymore. I am not sure if this is intended.

Hmm, I'm unable to reproduce, please provide a code snippet, thanks.

import numpy as np
from sklearn.utils import check_array
from sklearn.utils.testing import assert_raises

rnd = np.random.RandomState(0)
X_train_nan = rnd.uniform(size=(10, 3))
X_train_nan[0, 0] = np.nan
X_train_inf = rnd.uniform(size=(10, 3))
X_train_inf[0, 0] = np.inf

for X_train in [X_train_nan, X_train_inf]:
    assert_raises(ValueError, check_array, X_train, accept_sparse=False, force_all_finite=True)
    # assert_raises(ValueError, check_array, X_train, accept_sparse=False, force_all_finite=True, dtype='int')
    X = check_array(X_train, accept_sparse=False, force_all_finite=True, dtype='int')
    print(X[0, 0])

If you uncomment, you will see that no error is raised. Instead, np.nan and np.int are converted to int.

@qinhanmin2014
Copy link
Member

thanks, I'm unable to reproduce because I use a list.
Please ignore it here. I'll open an issue and we'll fix this in check_array.

@FlorianWilhelm
Copy link
Contributor

@qinhanmin2014 Thanks for your review. If you have no further comments, should @timbicker then change this PR to [MRG+1] so that others like @amueller, @NicolasHug, and @jnothman can review?

@qinhanmin2014
Copy link
Member

see the previous comment, please avoid calling check_array twice

@timbicker
Copy link
Contributor Author

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

@qinhanmin2014
Copy link
Member

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

which tests?

@timbicker
Copy link
Contributor Author

check_estimators_nan_inf in sklearn/utils/estimator_checks.py

The output is as follows:

FEstimator doesn't check for NaN and inf in fit. CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True) 'list' argument must have no negative elements
Traceback (most recent call last):
  File "/Users/tbicker/PRs/scikit-learn/sklearn/utils/estimator_checks.py", line 1333, in check_estimators_nan_inf
    estimator.fit(X_train, y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1109, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 635, in fit
    self._count(X, Y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1200, in _count
    self.class_count_.shape[0])
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1188, in _update_cat_count
    counts = np.bincount(X_feature[mask], weights=weights)
ValueError: 'list' argument must have no negative elements

sklearn/tests/test_common.py:93 (test_estimators[CategoricalNB()-check_estimators_nan_inf1])
estimator = CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True)
check = functools.partial(<function check_estimators_nan_inf at 0x1a1792a510>, 'CategoricalNB')

    @parametrize_with_checks(_tested_estimators())
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
>           check(estimator)

test_common.py:100: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:326: in wrapper
    return fn(*args, **kwargs)
../utils/estimator_checks.py:1338: in check_estimators_nan_inf
    raise e
../utils/estimator_checks.py:1333: in check_estimators_nan_inf
    estimator.fit(X_train, y)
../naive_bayes.py:1109: in fit
    return super().fit(X, y, sample_weight=sample_weight)
../naive_bayes.py:635: in fit
    self._count(X, Y)
../naive_bayes.py:1200: in _count
    self.class_count_.shape[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X_feature = array([-9223372036854775808,                    0,                    0,
                          0,                 ...              0,
                          0,                    0,                    0,
                          0])
Y = array([[1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1]])
cat_count = array([[0.],
       [0.]]), n_classes = 2

    def _update_cat_count(X_feature, Y, cat_count, n_classes):
        for j in range(n_classes):
            mask = Y[:, j].astype(bool)
            if Y.dtype.type == np.int64:
                weights = None
            else:
                weights = Y[mask, j]
>           counts = np.bincount(X_feature[mask], weights=weights)
E           ValueError: 'list' argument must have no negative elements

../naive_bayes.py:1188: ValueError

We assume that X has only values integer values >= 0. Due to the conversion of np.nan and np.inf to a large negative int in check_array, X contains a negative value and np.bincount consequently fails.

@qinhanmin2014 qinhanmin2014 changed the title [MRG] Add Categorical Naive Bayes [MRG+1] FEA Add Categorical Naive Bayes Sep 9, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to test other invariances: invariance under sample permutation, invariance under class label permutation up to ties, and maybe a test for how tie breaking is done to avoid regressions.

Still to review main code

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. 👀

@timbicker
Copy link
Contributor Author

@jnothman and @qinhanmin2014 thanks for your remarks

@jnothman jnothman merged commit 4e9f97d into scikit-learn:master Sep 23, 2019
@jnothman
Copy link
Member

Thanks and congratulations @timbicker and @FlorianWilhelm!

@FlorianWilhelm
Copy link
Contributor

@timbicker, great job! Thanks to everyone involved.

@Sandy4321
Copy link

can you please clarify , what you mean here and in

https://scikit-https://github.com/scikit-learn/scikit-learn/issues/15077learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

under feature distributions per :

discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution.

seems to be you mean that distributions are estimated from data?
for given categorical feature it is probabilities and conditional probability (values and target) from calculated from data?

as mentioned in
https://datascience.stackexchange.com/questions/58720/naive-bayes-for-categorical-features-non-binary
Some people recommend using MultinomialNB which according to me doesn't make sense because it considers feature values to be frequency counts

you never use some know pdf to fit data for each feature?

do you have example specific for categorical features data with real categorical data with several important features and several not important
but not this very generic example

import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB()
clf.fit(X, y)
CategoricalNB()
print(clf.predict(X[2:3]))
from
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

thanks a lot in advance

@@ -49,6 +51,12 @@ def _joint_log_likelihood(self, X):
predict_proba and predict_log_proba.
"""

@abstractmethod
def _check_X(self, X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this abstract method added is breaking downstream code.

Which is OK, but being the change hidden in this huge PR made it a bit more difficult than necessary to decipher/dig up if there is any relevant comments about the change, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are under aware of impact on downstream projects. See #15992

I also think we neglect to consider people inheriting from our objects. Can you submit a PR to remove the abstractmethod for 0.22.1 perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #15996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants