@@ -224,6 +224,40 @@ It is advisable to evaluate both models, if time permits.
224
224
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542> `_
225
225
3rd Conf. on Email and Anti-Spam (CEAS).
226
226
227
+ .. _categorical_naive_bayes :
228
+
229
+ Categorical Naive Bayes
230
+ -----------------------
231
+
232
+ :class: `CategoricalNB ` implements the categorical naive Bayes
233
+ algorithm for categorically distributed data. It assumes that each feature,
234
+ which is described by the index :math: `i`, has its own categorical
235
+ distribution.
236
+
237
+ For each feature :math: `i` in the training set :math: `X`,
238
+ :class: `CategoricalNB ` estimates a categorical distribution for each feature i
239
+ of X conditioned on the class y. The index set of the samples is defined as
240
+ :math: `J = \{ 1 , \dots , m \}`, with :math: `m` as the number of samples.
241
+
242
+ The probability of category :math: `t` in feature :math: `i` given class
243
+ :math: `c` is estimated as:
244
+
245
+ .. math ::
246
+
247
+ P(x_i = t \mid y = c \: ;\, \alpha ) = \frac { N_{tic} + \alpha }{N_{c} +
248
+ \alpha n_i},
249
+
250
+ where :math: `N_{tic} = |\{ j \in J \mid x_{ij} = t, y_j = c\}|` is the number
251
+ of times category :math: `t` appears in the samples :math: `x_{i}`, which belong
252
+ to class :math: `c`, :math: `N_{c} = |\{ j \in J\mid y_j = c\}|` is the number
253
+ of samples with class c, :math: `\alpha ` is a smoothing parameter and
254
+ :math: `n_i` is the number of available categories of feature :math: `i`.
255
+
256
+ :class: `CategoricalNB ` assumes that the sample matrix :math: `X` is encoded
257
+ (for instance with the help of :class: `OrdinalEncoder `) such that all
258
+ categories for each feature :math: `i` are represented with numbers
259
+ :math: `0 , ..., n_i - 1 ` where :math: `n_i` is the number of available categories
260
+ of feature :math: `i`.
227
261
228
262
Out-of-core naive Bayes model fitting
229
263
-------------------------------------
0 commit comments