Skip to content

[WIP] Implement general naive Bayes #16281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 93 commits into from
Closed

[WIP] Implement general naive Bayes #16281

wants to merge 93 commits into from

Conversation

remykarem
Copy link

@remykarem remykarem commented Jan 29, 2020

Reference Issues/PRs

Fixes #15077. See also #10856.

What does this implement/fix? Explain your changes.

This implements general naive Bayes (GeneralNB) in addition to the existing naive Bayes implementations like GaussianNB and BernoulliNB.

This implementation allows multiple assumptions on the features, namely the Bernoulli, Gaussian, Multinomial, and Categorical distributions. In the API, the user will be able to specify these distributions and their respective features.

I have divided this description into 3 sections as below:

  1. Design and usage
  2. Under the hood
  3. Runtime checks

1. Design and usage

The design of the API is similar to that of ColumnTransformer and Pipeline. To specify that columns 0-2 and 3-4 are to be modelled with Gaussian and categorical naive Bayes respectively, indicate these in the GeneralNB constructor and fit accordingly:

>>> clf = GeneralNB(models=[
...     ("gaussian", GaussianNB(), [0, 1, 2]),
...     ("categorical", CategoricalNB(), [3, 4])
... ])
>>> clf.fit(X, y)

It also accepts a list of strings of column names if the data to be fitted are pandas DataFrames:

>>> clf = GeneralNB([
...     ("gaussian", GaussianNB(), ["a", "b", "c"]),
...     ("categorical", CategoricalNB(), ["d", "e"])
... ])

Lastly, similar to ColumnTransformer, it also accepts callables like make_column_selector to specify DataFrame columns:

>>> from sklearn.compose import make_column_selector
>>> clf = GeneralNB([
...     ("gaussian", GaussianNB(), make_column_selector(pattern=r"[abc]")),
...     ("categorical", CategoricalNB(), make_column_selector(pattern=r"[de]"))
... ])

The attributes of the fitted estimators can be accessed using the self.named_models_ attribute. For example, to access the theta_ parameter of the bernoulli model,

clf.named_models_.bernoulli.theta_

2. Under the hood

For the GeneralNB.predict() function, we sum the _joint_log_likelihood() for each naive Bayes estimator, then subtract (n-1) log P(c) from this sum. Here is a pseudocode:

jlls = [model._joint_log_likelihood() for model in models]
jlls = jlls - log_prior
jll = jlls.sum(axis=0) + log_prior

3. Runtime checks

Check self.models:

  • Duplicate specification of column is not tolerated.
  • No. of cols specified must match the no. of cols in X.

Checks on parameter consistency across naive Bayes estimators are performed to ensure that specific parameters across the estimators stay the same. Otherwise, the calculation of the joint log likelihood will be wrong. Such parameters are:

  • class_prior* or priors^
  • fit_prior*
  • class_log_prior* or class_prior^

*used in BernoulliNB, MultinomialNB, ComplementNB, CategoricalNB
^used in GaussianNB

Data checks:
Methods like _check_X_y() and _check_X() check if the data type used during fitting is the same during prediction (i.e. NumPy array and pandas DataFrame).

4. Others

Partial fitting is not supported.
Currently, it is not okay to leave some columns out.

Progress:

  • API design
  • Main code
  • Docs
  • Tests

Any other comments?

PR submitted.

@remykarem remykarem changed the title [MRG] Implement general naive Bayes [WIP] Implement general naive Bayes May 11, 2020
@jnothman
Copy link
Member

Thanks @remykarem. You have a linter failing.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the linter build log for a list of errors. Do you need help resolving them?


# Subtract the class log prior from all the jlls
# but add it back after the summation
jlls = jlls - log_prior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_prior is among a handful of variables you've used without definition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jnothman I'm still working on this (switched this PR back to WIP). Some refactoring needed because I'm trying to fit in the remainder API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this kind of invalid code causes the linter to fail too...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will fix this.

# convert to feature if callable
self._cols = []
dict_col2model = {}
if callable(cols):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cols is not defined here.

@remykarem
Copy link
Author

Please see the linter build log for a list of errors. Do you need help resolving them?

Yes please! I have been trying to figure this out. Sorry for not reaching out to you earlier.

Base automatically changed from master to main January 22, 2021 10:51
@avm19
Copy link
Contributor

avm19 commented Feb 5, 2022

@remykarem have you abandoned this project or has it stalled for some other reason? Would you like someone to take over or chip in? I drafted my own wrapper for Naive Bayes a month ago and was thinking about contributing it, but now I discovered your work, which seems almost complete.

@remykarem
Copy link
Author

remykarem commented Feb 6, 2022

@remykarem have you abandoned this project or has it stalled for some other reason? Would you like someone to take over or chip in? I drafted my own wrapper for Naive Bayes a month ago and was thinking about contributing it, but now I discovered your work, which seems almost complete.

@avm19 Sorry, I got busy after a while and didn't manage to complete this. I think it would be great for someone to take over this project :)

@avm19
Copy link
Contributor

avm19 commented Feb 11, 2022

take

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:naive_bayes Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement GeneralNB
4 participants