-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH add Naive Bayes Metaestimator ColumnwiseNB
(aka "GeneralNB")
#22574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR needs attention. |
@jnothman Would you be able to review this PR or advise on how to proceed to get it merged in foreseeable future? I am asking you, because you thumbs-upped. Thanks! |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
The inability to handle The bits we are trying to re-use are I suggest to merge this PR as it is now. It is a perfectly working contribution that does the job a user expects of it. Any refactoring and re-use of ColumnTransformer won't affect the public API and can be done in subsequent PRs. And metadata routing also to follow later. What do you think, @glemaitre ? |
It is not affecting the public API but it will affect the maintenance. I personally uneasy to merge something in which we already foresee internal refactoring. |
I don't know if a refactoring will be necessary or likely. I understand that you prefer your proposal of using |
I think I agree with @glemaitre here that having column transformer like code in this estimator is much harder to maintain. |
Do you think that maintaining a To this calculation one should add tricks and changes necessary for |
This PR is ready for review.
Dear reviewer, please don't be discouraged by the amount of text below! The enumerated comments are optional and simply document certain considerations that I had while implementing this PR.
Reference Issues/PRs
Fixes #12957 #15077 (issues)
Supersedes #16281 (stalled PR)
Somewhat related to #10856
Incorporates changes from PR #22565 , where I add abstract methods in
_BaseDiscreteNB
What does this implement/fix? Explain your changes.
This implements a meta-estimator that mixes existing naive Bayes classifiers:
GaussianNB
,BernoulliNB
, and others. The interface is inspired byColumnTransformer
, as was previously suggested by @timbicker , @jnothman and others:The meta-estimator combines multiple naive Bayes estimators by expressing the overall joint probability
P(x,y)
throughP(x_i,y)
, the joint probabilities of the subestimators:where
N
is the number of estimators (3 in the example). The joint probabilitiesP(x_i,y)
are exposed through the private method_joint_log_likelihood(self, X)
in any instance of_BaseNB
class, which is a parent to all naive Bayes classifiers implemented so far, including the proposed meta-estimator.See further equations and detailed explanation inside the collapsable section in this comment.
Any other comments?
GeneralNB
in the past, cf. stalled PR [WIP] Implement general naive Bayes #16281. I think terms like "general" or "generalized" are too ambiguous. For example, Bernoulli and Gaussian distributions are both members of the general exponential family of distributions -- is this the generalisation we refer to? Another generalisation is having different distribution families for each class, e.g.,P(x|y=0)~N
andP(x|y=1)~Mult
, or working with any user-provided distribution. In addition, I don't recall seeing "the general naive Bayes" in the literature: all models were just naive Bayes with further assumptions of various degree of generality.In contrast,
ColumnwiseNB
is specific: it tells the user that something is going on with columns and forms a very appropriate association withColumnTranformer
. I think this is a much better name, and maybe there are even better options. I'll be fine with any decision, but I thinkColumnwiseNB
would be a good choice.or if someone wrote an example using my implementation, while testing it at the same time. I guess, a paragraph or two should be added to the Guide as well. Shall we leave all this for another PR?[Updated on 2022-04-10] A section to naive Bayes guide is added. An auto-example (titanic dataset) is created for the gallery.
fit
andpartial_fit
following_BaseDiscreteNB
and tookColumnTransformer
's implementation as a blueprint for parallelism. I am not very familiar with the inner workings ofjoblib
, and I ask reviewers to pay extra attention here.[Related comment by jnothman: "... I suspect that fitting also will not benefit much from parallelism, but this may be more of an empirical question."]
_BaseComposition
is a parent class and get/set methods for parameters have been implemented and tested. Subestimators are cloned prior to fitting.remainder
parameter (un)like inColumnTransformer
. It complicates the logic a little bit. Do you think it'd be really nice to have it? Maybe leave it till the next PR then.GaussianNB
and_BaseDiscreteNB
use different conventions. For example,priors
vsclass_prior
as hyperparameters andclass_log_prior_
vsclass_prior_
as attributes. I do not know which convention is more preferable. Just a remark.Log P(x,y) = Log P(x_1,y) + ... + Log P(x_N,y) - (N - 1) Log P(y)
, that the class log priors are finite and agree between the estimators and the subestimator:- inf < Log P(y) = Log P(y|1) = ... = Log P(y|N)
. The meta-estimators does not check this condition. Meaningless results, includingNaN
, may be produced by the meta-estimator if the class priors differ or contain a zero probability. Usually this is not a problem, because all children of_BaseNB
calculate class priors directly as relative frequencies; unseen classes frompartial_fit
are on the user, not on us. But in general, class priors can be estimated differently or provided by the user in subestimators and the meta-estimator independently. We can distinguish 2 issues. First, what if the subestimators' class priors do not agree with the meta-estimator's prior? Then we could "marry" the subestimators not "at joint probabilities", but "at conditional probabilities", just as naive Bayes prescribes:Log P(x|y) = Log P(x_1|y) + ... + Log P(x_N|y)
or equivalentlyLog P(x,y) = Log P(x_1,y) - Log P(y|1) + ... + Log P(x_N,y) - Log P(y|N) + Log P(y)
, whereLog P(y|i)
is the class prior from the ith subestimator. Second, what if some prior probabilities are zero? A problem arises only when the the zero found in the meta-estimator's class prior. This is something to think about. I wonder if someone came across a paper or a textbook that discusses and resolves this issue. [Update: see most recent comments 1 and 2]Finally, I would appreciate any comments and suggestions, especially those addressing mistakes or aimed at efficient programming and "good code". I am glad to be learning from this community.
Updated on Wednesday, February 23, 2022 05:57:39 UTC:
estimatorNBs
vsestimators
. Initially, the parameter's name was "estimators", by analogy with "transformers". It turns out that many common tests failed simply because they duck-type models based on the presence ofestimators
in the parameter list. As a result, the tests assume that they're dealing with something like VotingClassifier and feed it LogisticRegression--something that ColumnwiseNB is not suppose to accept.I took the path of the least resistance and renamed[Further update. Following jnothman's comment, the parameter was later renamed toestimators
toestimatorNBs
.nb_estimators
. glemaitre thinks the name should beestimators
. My comment summarises my thoughts and links to the duck-typing issue.]The only change I made to pre-existing test files is adding[Updated on 2022-05-29] The problematic test is passed by catching the error whenColumnwiseNB
toVALIDATE_ESTIMATOR_INIT
, a list that exemptsColumnTransformer
and other estimators from a certain test._estimators
setter inColumnwiseNB
fails to unpack parameters passed by the test. See FIX Remove validation from __init__ and set_params for ColumnTransformer #22537 for the similar change inColumnTransformer
.Updated on Wednesday, February 23, 2022 09:14:13 UTC:
pytest
. As a result, I cannot pytest such an output. Could someone please explain this unexpected behaviour or suggest a solution?Updated on Sunday, February 27, 2022 09:00:13 UTC:
n_features_in_
.Advice needed. First, I do not fully understand the purpose of this attribute and whether it is needed given we use.feature_names_in_
Second, setting it up using.BaseEstimator._check_n_features
can be problematic, since this method expects a "converted array", and I currently avoid convertingX
in the meta-estimator[Updated on 2022-10-04]: Although
BaseEstimator._check_n_features
is typically passed converted arrays (e.g., fromBaseEstimator._validate_data
orColumnTransformer.fit_transform
), it seems to work fine with pandas dataframes too. All examples work and all tests are passed.Updated on Thursday, April 7, 2022 17:14:08 UTC:
VotingClassifier
as a prototype instead ofColumnTransformer
. The difference is that we would be passing a list of tuples(str, estimator)
instead of(str, estimator, columns)
. In the former case,estimator
would have to be a base naive Bayes estimator wrapped together with a column selector, e.g. aGaussianNB
+ a selector of float columns. In fact, we could've redefined all_BaseNB
to add this new parameter for column selection, whose default value selects all columns ofX
. The advantage is that now the subestimators' columns subsets are treated as separate hyperparameters for the purpose ofGridSearch
. We could writeparam_grid = {"clf__gnb1__columns": [[5], [5, 6]]}
instead of much more verbose
Updated on Monday, April 11, 2022 05:30:23 UTC:
ColumnwiseNB
displays subestimators in parallel, just likeColumnTransformer
.Updated on Thursday, July 7, 2022 23:24:45 UTC:
_validate_params
#23462 are made. A custom replacement fortest_common.py::test_check_param_validation
is implemented intest_naive_bayes.py::test_cwnb_check_param_validation
. It is needed becauseutils.estimator_checks._construct_instance
does not know how to create an instance of ColumnwiseNB, which leads totest_common.py
skipping this test for ColumnwiseNB (without a notification bypytest
; a message is displayed only when the test is called directly). ColumnTransformer, Pipeline, and a handful of other estimators suffer from this problem too.Updated on Friday, October 7, 2022 16:26:00 UTC:
ColumnTransformer
is not enough? This question was asked at a Discord voice meeting. First,ColumnTransformer
is a transformer, not a classifier. Second,ColumnTransformer
simply concatenates the transformed columns, whereas the naive Bayes requires additional calculations. Importantly, these additional calculations are done not on the subestimators' standard output (predict
,predict_proba
,predict_log_proba
), but on the intermediate quantities (predict_joint_log_proba
, formerly known as_joint_log_likelihood
), see this discussion. This is whyColumnTransformer
cannot be used even with additional logic wrapped on top of it.