-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the workflow you want to enable
There is an ongoing discussion on #22574 about introducing a new estimator named ColumnwiseNB
, which aims to handle different types of features by applying different Naive Bayes models column-wise. This approach is promising for datasets that contain a mix of categorical, binary, and continuous variables, each of which might require a different probabilistic approach for effective classification.
from sklearn.naive_bayes import BernoulliNB, GaussianNB, CategoricalNB
clf = ColumnwiseNB(nb_estimators=[('gnb1', GaussianNB(), [0, 1]),
('bnb2', BernoulliNB(), [2]),
('cnb1', CategoricalNB(), [3, 4])])
clf.fit(X_train, y_train)
clf.predict(X_test)
Describe your proposed solution
While scikit-learn is considering the ColumnwiseNB
as a potential addition, I've developed a similar feature for a while called GeneralNB
in the wnb Python package. This class also supports different distributions for each feature, providing flexibility in handling a variety of data types within a Naive Bayes framework. I would like to introduce the community to this already-implemented solution to gather feedback, comments, and suggestions. Understanding whether GeneralNB
could serve as a good alternative or complementary solution to ColumnwiseNB
could be beneficial for both scikit-learn developers and users looking for advanced Naive Bayes functionalities.
from wnb import GeneralNB, Distribution as D
gnb = GeneralNB(
distributions=[D.NORMAL, D.NORMAL, D.BERNOULLI, D.CATEGORICAL, D.CATEGORICAL])
gnb.fit(X_train, y_train)
gnb.predict(X_test)
This solution fully adheres to scikit-learn's API and supports the following continuous and discrete distributions at the moment of writing this issue:
- Normal
- Lognormal
- Exponential
- Uniform
- Pareto
- Gamma
- Beta
- Chi-squared
- T
- Rayleigh
- Bernoulli
- Categorical
- Geometric
- Poisson
I encourage community feedback on this implementation and am open to collaborating to integrate similar functionality into scikit-learn if deemed beneficial.
Describe alternatives you've considered, if relevant
No response
Additional context
No response