Skip to content

FIX discrete Naive Bayes model fitting for degenerate single-class case #18925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jan 24, 2021

Conversation

dpoznik
Copy link
Contributor

@dpoznik dpoznik commented Nov 27, 2020

closes #18974

When training multinomial naive Bayes models as part of a larger pipeline, it is possible for a degenerate case to arise, wherein there is just one class label. Prior to this change, it was possible to fit a single-class MultinomialNB model. One would expect that using such a model for prediction would deterministically yield the one class label. However, sometimes an IndexError would arise.

Ultimately, this traced to the fact that LabelBinarizer.transform returns an array of shape (n_samples, 1) when there are one or two classes, but naive_bayes._BaseDiscreteNB.fit and naive_bayes._BaseDiscreteNB.partial_fit (reasonably) assumed that if the return value had one column, that meant there were exactly two classes.

With this fix, the single-class case is handled differently from the two-class case, thereby ensuring the expected behavior and eliminating the source of the IndexError.

On this branch, all tests run via these three commands pass:

pytest sklearn/naive_bayes.py
pytest doc/modules/naive_bayes.rst
pytest sklearn/tests/test_naive_bayes.py

The last one includes a new regression test that fails on master.

@dpoznik dpoznik force-pushed the handle-degenerate-single-class-nb branch from d3793a0 to b3d62fa Compare November 27, 2020 04:28
@dpoznik dpoznik changed the title [MRG] Fix degenerate single-class Naive Bayes model fitting [WIP] Fix degenerate single-class Naive Bayes model fitting Nov 27, 2020
@@ -552,7 +552,7 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
if _check_partial_fit_first_call(self, classes):
# This is the first call to partial_fit:
# initialize various cumulative counters
n_effective_classes = len(classes) if len(classes) > 1 else 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small behavior change here when len(classes) == 1.
However, the docstring implies that len(classes) should be 2 for binary problems:

        classes : array-like of shape (n_classes,), default=None
            List of all the classes that can possibly appear in the y vector.

Assuming this is true, then I believe the only behavior change is for the degenerate single-class case, as intended.

@dpoznik dpoznik changed the title [WIP] Fix degenerate single-class Naive Bayes model fitting [MRG] Fix naive Bayes model fitting for degenerate single-class case Nov 27, 2020
@dpoznik dpoznik force-pushed the handle-degenerate-single-class-nb branch 2 times, most recently from 131c483 to 8f3c089 Compare November 28, 2020 14:16
@dpoznik dpoznik force-pushed the handle-degenerate-single-class-nb branch from 8f3c089 to 1d10e36 Compare November 28, 2020 14:22
@dpoznik dpoznik changed the title [MRG] Fix naive Bayes model fitting for degenerate single-class case [MRG] Fix discrete Naive Bayes model fitting for degenerate single-class case Dec 6, 2020
@glemaitre glemaitre self-requested a review December 17, 2020 22:17
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass. We will need to add an entry in the what's new v1.0.rst as a bug fix

@dpoznik
Copy link
Contributor Author

dpoznik commented Dec 18, 2020

We will need to add an entry in the what's new v1.0.rst as a bug fix

Cool, I've drafted this in 6e51074.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart of some style changes, LGTM.

dpoznik and others added 2 commits December 18, 2020 11:06
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@glemaitre glemaitre changed the title [MRG] Fix discrete Naive Bayes model fitting for degenerate single-class case FIX discrete Naive Bayes model fitting for degenerate single-class case Dec 18, 2020
@glemaitre
Copy link
Member

@thomasjpfan @lorentzenchr Would you like to have a second look

dpoznik and others added 2 commits January 18, 2021 07:53
…list-classes

Centralize lists of naive Bayes classes to streamline test parameterization
Base automatically changed from master to main January 22, 2021 10:53
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpoznik Thanks for fixing this. You need to resolve merge conflicts. Beware that we've changed the default branch from master to main. Let me/us know if you need help.

As per @lorentzenchr in code review: "Pickability is tested in
`check_estimators_pickle` (see `utils/estimator_checks.py`)
in `tests/test_common.py`. I verified it by running
`pytest sklearn/tests/test_common.py -v -k BernoulliNB` and so on."
@dpoznik
Copy link
Contributor Author

dpoznik commented Jan 23, 2021

Thanks for fixing this.

No problem; it was fun :)
Thanks for reviewing!

You need to resolve merge conflicts.

This should be all set with 9bbfb22.

Beware that we've changed the default branch from master to main.

Cool; thanks for the heads-up!

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr
Copy link
Member

@glemaitre Before merging, could you shortly skim over the last changes - just to be sure:smirk:

@glemaitre glemaitre merged commit 6f32544 into scikit-learn:main Jan 24, 2021
@glemaitre
Copy link
Member

Thanks @dpoznik Everything seems good. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discrete Naive Bayes classifiers crash unnecessarily on degenerate one-class data
4 participants