Discriminative Generative: R Follow A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

UNIT V:

Probabilistic models: The normal distribution and its geometric interpretations,


Probabilistic models for categorical data, Discriminative learning by optimizing
conditional Likelihood.
Features: Kinds of feature, Feature transformations, Feature construction and selection.
Model ensembles: Bagging and random forests, Boosting,
-------------------------------------------------------------------------------------------------------------
PROBABILISTIC MODELS
The posterior probability distribution P(Y |X), where Y is the target variable and X are the
features are called discriminative probabilistic models. That is, given X they return a
probability distribution over Y.
The other main classes of probabilistic models are called generative models. They model
the joint distribution P(Y ,X) of the target Y and the feature vector X. Once we have access
to this joint distribution we can derive any conditional or marginal distribution involving
the same variables. In particular, since it follows that the posterior
distribution can be obtained as Alternatively, generative
models can be described by the likelihood function P(X|Y ), since P(Y,X) = P(X|Y )P(Y )
and the target or prior distribution can be easily estimated or postulated.
THE NORMAL DISTRIBUTION AND ITS GEOMETRIC INTERPRETATIONS
We can draw a connection between probabilistic and geometric models by considering
probability distributions defined over Euclidean spaces. The most common such
distributions are normal distributions, also called Gaussians; we start by considering the
univariate, two-class case. Suppose the values of x ∈ R follow a mixture model: i.e., each
class has its own probability distribution (a component of the mixture model). We will
assume a Gaussian mixture model, which means that the components of the mixture are
both Gaussians. We thus have
Figure 9.3. (left) If the features are uncorrelated and have the same variance, maximum likelihood
classification leads to the basic linear classifier, whose decision boundary is orthogonal to the line
connecting the means. (middle) As long as the per-class covariance matrices are identical, the Bayes-
optimal decision boundary is linear – if we were to decorrelate the features by rotation and scaling, we
would again obtain the basic linear classifier. (right) Unequal covariance matrices lead to hyperbolic
decision boundaries, which means that one of the decision regions is non-contiguous.
Notice the circles and ellipses in Figure 9.3, which provide a visual summary of the
covariance matrix. By projecting the shape for the positive class down to the x-axis we
obtain the interval i.e., one standard deviation around the mean–
and similar for the negative class and the y-axis. Three cases can be distinguished:
(i) both x and y standard deviations are equal and the correlation coefficient is zero, in
which case the shape is a circle; (ii) the standard deviations are different and the
correlation coefficient is zero, which means the shape is an ellipse parallel to the axis with
the largest standard deviation; (iii) the correlation coefficient is non-zero: the orientation
of the ellipse gives the sign of the correlation coefficient, and its width varies with the
magnitude of the correlation coefficient.
PROBABILISTIC MODELS FOR CATEGORICAL DATA
Categorical variables or features are everywhere in machine learning. Perhaps the most
common form of the Bernoulli distribution models whether or not a word occurs in a
document. That is, for the ith word in our vocabulary we have a random variable Xi ruled
by a Bernoulli distribution. The joint distribution over the bit vector X = (X1, . . . ,Xk) is
called a multivariate Bernoulli distribution. Variables with more than two outcomes are
also common: for example, every word position in an e-mail corresponds to a categorical
variable with k outcomes, where k is the size of the vocabulary. The multinomial
distribution clears itself as a count vector: a histogram of the number of occurrences of all
vocabulary words in a document. This establishes an alternative way of modeling text
documents that allows the number of occurrences of a word to influence the classification
of a document.
Both these document models are in common use. Despite their differences, they both
assume independence between word occurrences, generally referred to as the naive Bayes
assumption. In the multinomial document model, this follows from the very use of the
multinomial distribution, which assumes that words at different word positions are drawn
independently from the same categorical distribution.
In the multivariate Bernoulli model we assume that the bits in a bit vector are statistically
independent, which allows us to compute the joint probability of a particular bit vector (x1,
. . . ,xk ) as the product of the probabilities of each component P(X i = xi ). In any case, the
experience is that, while the naive Baye’s assumption almost certainly leads to poor
probability estimates, it often doesn’t harm ranking performance. This means that,
provided the classification threshold is chosen with some care, we can usually get good
classification performance too.
Using a naive Bayes model for classification
Assume that we have chosen one of the possible distributions to model our data X. In a
classification context, we furthermore assume that the distribution depends on the class, so
that P(X|Y = spam) and P(X|Y = ham) are different distributions. The more different these
two distributions are, the more useful the features X are for classification. Thus, for a
specific e-mail x we calculate both P(X = x|Y = spam) and P(X = x|Y = ham), and apply
one of several possible decision rules:
maximum likelihood (ML) – predict argmaxy P(X = x|Y = y);
maximum a posteriori (MAP) – predict argmaxy P(X = x|Y = y)P(Y = y);
recalibrated likelihood – predict argmaxy wyP(X = x|Y = y).
The relation between the first two decision rules is that ML classification is equivalent to
MAP classification with a uniform class distribution. The third decision rule generalizes
the first two in that it replaces the class distribution with a set of weights learned from the
data: this makes it possible to correct for estimation errors in the likelihoods,
This may not seem such a big deal if we are only interested in classification, and not in the
probability estimates as such. However, an often overlooked consequence of having
uncalibrated probability estimates such as those produced by naive Bayes is that both the
ML and MAP decision rules become inadequate. Unless we have evidence that the model
assumptions are satisfied, the only sensible thing to do in this case is to invoke the
recalibrated likelihood decision rule, which requires one to learn a weight vector over the
classes, in order to correct for the estimation errors in the likelihoods. Specifically, we
want to find weights wi such that predicting argmaxy wyP(X = x|Y = y) results in the
smallest possible loss e.g., the number of misclassified examples over a test set. For two
classes this can be solved by the procedure for turning rankers into classifier. To see this,
notice that for two classes the recalibrated likelihood decision rule can be rewritten as

Training a naive Bayes model


Training a probabilistic model usually involves estimating the parameters of the
distributions used in the model. The parameter of a Bernoulli distribution can be estimated
by counting the number of successes d in n trials and setting ˆθ = d/n. In other words, we
count, for each class, how many e-mails contain the word in question.
Such relative frequency estimates are usually smoothed by including pseudocounts,
representing the outcome of virtual trials according to some fixed distributions. In the case
of a Bernoulli distribution the most common smoothing operation is the Laplace
correction, which involves two virtual trials, one of which results in success and the other
in failure. Consequently, the relative frequency estimate is changed to (d +1)/(n +2). From
a Bayesian perspective this amounts to adopting a uniform prior, representing our initial
belief that success and failure are equally likely. If appropriate, we can strengthen the
influence of the prior by including a larger number of virtual trials, which means that
more data is needed to move the estimate away from the prior.
For a categorical distribution smoothing adds one pseudo-count to each of the k
categories, leading to the smoothed estimate (d +1)/(n +k). The m-estimate generalizes
this further by making both the total number of pseudo-counts m and the way they are
distributed over the categories into parameters. The estimate for the ith category is defined
as (d +pim)/(n+m), where pi is a distribution over the categories. Notice that smoothed
relative frequency estimates – and hence products of such estimates – can never attain the
extreme values ˆθ =0 or ˆθ = 1.
DISCRIMINATIVE LEARNING BY OPTIMISING CONDITIONAL
LIKELIHOOD
Naive Bayes models are generative: after training they can be used to generate data. The
most commonly used discriminative models is logistic regression. The easiest way to
understand logistic regression is as a linear classifier whose probability estimates have
been logistically calibrated, but with one crucial difference: calibration is an integral part
of the training algorithm, rather than a post-processing step. While in generative models
the decision boundary is a by-product of modeling the distributions of each class, logistic
regression models the decision boundary directly.
For example, if the classes are overlapping then logistic regression will tend to locate the
decision boundary in an area where classes are maximally overlapping, regardless of the
‘shapes’ of the samples of each class. This results in decision boundaries that are
noticeably different from those learned by generative classifiers (Figure 9.6).
The likelihood ratio is exp(γ(d(x)−d0)) with d(x) = w· x − t . Since we are learning the
parameters all at once in discriminative learning, we can absorb γ and d0 into w and t. So
the logistic regression model is simply given by

This is called conditional likelihood to stress that it gives us the conditional probability
P(yi |xi ) rather than P(xi ) as in a generative model. Notice that our use of the product
requires the assumption that the y-values are independent given x; but this is an entirely
reasonable assumption and not nearly as strong as the naive Bayes assumption of x being
independent within each class. As usual, the logarithm of the likelihood function is easier
to work with:

We want to maximise the log-conditional likelihood with respect to these parameters,


which means that all partial derivatives must be zero:

Although these equations do not yield an analytic solution, they can be used to obtain
further insight into the nature of logistic regression. Concentrating on t, we first need to do
some algebraic groundwork.

Similarly for the negatives:


It follows that the partial derivative of LCL with respect to t has a simple form:

For the optimal solution this partial derivative is zero. What this means is that, on average,
the predicted probability should be equal to the proportion of positives pos. This is a
satisfying result, as it is clearly a desirable global property of a calibrated classifier.

FEATURES
KINDS OF FEATURE:
Consider two features which are integers, one describing a person’s age and the other their
house number, but the way we use those features can be quite different. Calculating the
average age of a group of people is meaningful, but an average house number is probably
not very useful. These depend on whether the feature values are expressed on a
meaningful scale.
House numbers are not really integers but ordinals: we can use them to determine that
number 10’s neighbours are number 8 and number 12, but we cannot assume that the
distance between 8 and 10 is the same as the distance between 10 and 12. Because of the
absence of a linear scale it is not meaningful to add or subtract house numbers, which
excludes operations such as averaging.
Calculations on features:
Three main categories are statistics of central tendency, statistics of dispersion and shape
statistics. The most important of statistics of central tendency are
 the mean or average value; to calculate mean we need a feature expressed on some
linear scale
 the median, which is the middle value if we order the instances from lowest to highest
feature value, the median tends to lie between the mode and the mean with few
exceptions to this rule.
 the mode, which is the majority value or values which we can calculate whatever the
domain of the feature.
Two well-known statistics of dispersion are the variance or average squared deviation
from the mean, and its square root, the standard deviation.
 Variance and standard deviation essentially measure the same thing, but the latter has
the advantage that it is expressed on the same scale as the feature itself. For example,
the variance of the body weight in kilograms of a group of people is measured in kg 2,
whereas the standard deviation is measured in kilograms.
 A simpler dispersion statistic is the difference between maximum and minimum value,
which is called the range. A midrange point is the mean of the two extreme values.
 Other statistics of dispersion include percentiles. The p-th percentile is the value such
that p per cent of the instances fall below it. If we have 100 instances, if p is a multiple
of 25 the percentiles are also called quartiles, and if it is a multiple of 10 the
percentiles are also called deciles. Note that the 50th percentile, the 5th deciles and the
second quartile are all the same as the median. Percentiles, deciles and quartiles are
special cases of quantiles. Once we have quantiles we can measure dispersion as the
distance between different quantiles. For instance, the inter quartile range is the
difference between the third and first quartile (i.e., the 75th and 25th percentile).
The skew and peakedness of a distribution can be measured by shape statistics such as
skewness and kurtosis.
 The main idea is to calculate the third and fourth central moment of the sample. In
general, the k-th central moment of a sample {xi,... xn} is defined as
where μ is the sample mean.
 Clearly, the first central moment is the average deviation from the mean; this is always
zero, as the positive and negative deviations cancel each other out; and the second
central moment is the average squared deviation from the mean, otherwise known as
the variance. The third central moment m3 can again be positive or negative.
 Skewness is then defined as m3/σ3, where σ is the sample’s standard deviation. A
positive value of skewness means that the distribution is right-skewed, which means
that the right tail is longer than the left tail. Negative skewness indicates the opposite,
left-skewed case.
 Kurtosis is defined as m4/σ4. As it can be shown that a normal distribution has kurtosis,
people often use excess kurtosis m4/σ4−3 as the statistic of interest. Briefly, positive
excess kurtosis means that the distribution is more sharply peaked than the normal
distribution.

Categorical, ordinal and quantitative features


 Features without ordering or scale are called categorical features. One subspecies of
the categorical features is the Boolean feature, which maps into the truth values true
and false.
 Features with an ordering but without scale are called ordinal features. The domain of
an ordinal feature is some totally ordered set, such as the set of characters or strings.
Example are features that express a rank order: first, second, third, and so on. Ordinal
features allow the mode and median as central tendency statistics, and quantiles as
dispersion statistics.
 Features with a meaningful numerical scale, are called quantitative; they most often
involve a mapping into the reals(continuous). Even if a feature maps into a subset of
the reals, such as age expressed in years, the various statistics such as mean or standard
deviation still require the full scale of the reals.
Structured features
 An instance is a vector of feature values. In other words, the instance space is a
Cartesian product of d feature domains: X =F1 ×. . .× Fd . Identifying an instance with
its vector of feature values is an abstraction, which is the result of filtering out
unnecessary information.
 Representing an e-mail as a vector of word frequencies is an example of an abstraction.
However, sometimes it is necessary to avoid such abstractions, and to keep more
information about an instance than can be captured by a finite vector of feature values.
For example, we could represent an e-mail as a long string; or as a sequence of words
and punctuation marks; or as a tree that captures the HTML mark-up; and so on.
Features that operate on such structured instance spaces are called structured features.
 Structured features are not unlike queries in a database query language such as SQL or
a declarative programming language such as Prolog.
 Structured features can be constructed either prior to learning a model or
simultaneously with it. The first scenario is often called propositionalisation because
the features can be seen as a translation from first-order logic to propositional logic
without local variables.

FEATURE TRANSFORMATIONS
A feature transformation is improving the utility of a feature by removing, changing or
adding information. We could classify feature types by, quantitative features are more
detailed than ordinal ones, followed by categorical features, and finally Boolean features.
The best-known feature transformations are those that turn a feature of one type into
another of the next type down this list. But there are also transformations that change the
scale of quantitative features, or add a scale to ordinal, categorical and Boolean features.
Table 10.2 introduces the terminology we will be using.

 Binarisation transforms a categorical feature into a set of Boolean features, one for
each value of the categorical feature. This loses information since the values of a single
categorical feature are mutually exclusive, but is sometimes needed if a model cannot
handle more than two feature values.
 Unordering trivially turns an ordinal feature into a categorical one by discarding the
ordering of the feature values. This is often required since most learning models cannot
handle ordinal features directly. An interesting alternative that we will explore below is
to add a scale to the feature by means of calibration.

Thresholding and discretisation


In summary, unsupervised thresholding typically involves calculating some statistic over
the data, whereas supervised thresholding requires sorting the data on the feature value
and traversing down this ordering to optimize a particular objective function such as
information gain.
Discretisation transforms quantitative feature into an ordinal feature. Each ordinal value is
referred to as a bin and corresponds to an interval of the original quantitative feature.
Again, we can distinguish between supervised and unsupervised approaches.
Unsupervised discretisation methods typically require one to decide the number of bins
beforehand. A simple method that often works reasonably well is to choose the bins so
that each bin has approximately the same number of instances: this is referred to as equal-
frequency discretisation.
If we choose two bins then this method coincides with thresholding on the median. More
generally, the bin boundaries are quantiles: for instance, with 10 bins the bin boundaries
of equal-width discretisation are deciles. Another unsupervised discretisation method is
equal-width discretisation, which chooses the bin boundaries so that each interval has the
same width. The interval width can be established by dividing the feature range by the
number of bins if the feature has upper and lower limits; alternatively, we can take the bin
boundaries at an integer number of standard deviations above and below the mean.
Switching now to supervised discretisation methods, we can distinguish between top–
down or divisive discretisation methods on the one hand, and bottom–up or agglomerative
discretisation methods on the other. Divisive methods work by progressively splitting
bins, whereas agglomerative methods proceed by initially assigning each instance to its
own bin and successively merging bins. In either case an important role is played by the
stopping criterion, which decides whether a further split or merge is worthwhile.
Normalization and calibration
Feature normalization is often required to neutralise the effect of different quantitative
features being measured on different scales. If the features are approximately normally
distributed, we can convert them into z-scores by centering on the mean and dividing by
the standard deviation. In certain cases it is mathematically more convenient to divide by
the variance instead. If we don’t want to assume normality we can centre on the median
and divide by the inter-quartile range.
Feature calibration is understood as a supervised feature transformation adding a
meaningful scale carrying class information to arbitrary features. This has a number of
important advantages. For instance, it allows models that require scale, such as linear
classifiers, to handle categorical and ordinal features. It also allows the learning algorithm
to choose whether to treat a feature as categorical, ordinal or quantitative.

Incomplete features
Consider what to do if we don’t know a feature’s value for some of the instances.
Example is how to classify an e-mail if we didn’t know whether it contained one of the
vocabulary words or not. Probabilistic models handle this rather gracefully by taking a
weighted average over all possible values of the feature:

Here, Y is the target variable as usual, X stands for the features that are observed for the
instance to be classified, while Z are the features that are unobserved at classification
time. The distribution P(Z) can be obtained from the trained model, at least for a
generative model ,if our model is discriminative we need to estimate P(Z) separately.
Missing feature values at training time are trickier to handle. First of all, the very fact that
a feature value is missing may be correlated with the target variable. For example, the
range of medical tests carried out on a patient is likely to depend on their medical history.
For such features it may be best to have a designated ‘missing’ value so that, for instance,
a tree model can split on it. However, this would not work for, say, a linear model. In such
cases we can complete the feature by ‘filling in’ the missing values, a process known as
imputation.

FEATURE CONSTRUCTION AND SELECTION


Feature construction is constructing new features from several original features. A simple
example of this can be used to improve the naive Baye’s classifier. Remember that in text
classification applications we have a feature for every word in the vocabulary, which
ignore not only the order of words but also their adjacency. This means that sentences
such as ‘they write about machine learning’ and ‘they are learning to write about a
machine’ will be virtually indistinguishable, even though the former is about machine
learning and the latter is not. In the information retrieval literature, a multi-word phrase is
referred to as an n-gram (unigram, bigram, trigram and so on).
Taking this idea one step further, we can construct a new feature from two Boolean or
categorical features by forming their Cartesian product. For example, if we have one
feature Shape with values Circle, Triangle and Square, and another feature Colour with
values Red, Green and Blue, then their Cartesian product would be the feature (Shape,
Colour) with values (Circle, Red), (Circle, Green), (Circle, Blue), (Triangle, Red), and so
on. The effect that this would have depends on the model being trained. Constructing
Cartesian product features for a naive Baye’s classifier means that the two original
features are no longer treated as independent, and so this reduces the strong bias that naive
Baye’s models have. This is not the case for tree models, which can already distinguish
between all possible pairs of feature values. On the other hand, a newly introduced
Cartesian product feature may incur a high information gain, so it can possibly affect the
model learned.
Once we have constructed new features it is often a good idea to select a suitable subset of
them prior to learning. There are two main approaches to feature selection. The filter
approach scores features on a particular metric and the top-scoring features are selected.
An interesting variation is provided by the Relief feature selection method, which
repeatedly samples a random instance x and finds it’s nearest hit h (instance of the same
class) as well as its nearest miss m (instance of opposite class). The i -th feature’s score is
then decreased by Dis(xi ,hi )2 and increased by Dis(xi ,mi )2, where Dis is some distance
measure (e.g., Euclidean distance for quantitative features, Hamming distance for
categorical features). The intuition is that we want to move closer to the nearest hit while
differentiating from the nearest miss.
To detect features that are useful in the context of other features we need to evaluate sets
of features; this usually goes under the name of wrapper approaches. The idea is that
feature selection is ‘wrapped’ in a search procedure that usually involves training and
evaluating a model with a candidate set of features. Forward selection methods start with
an empty set of features and add features to the set one at a time, as long as they improve
the performance of the model. Backward elimination starts with the full set of features and
aims at improving performance by removing features one at a time.
Since there are an exponential number of subsets of features it is usually not feasible to
search all possible subsets, and most approaches apply a ‘greedy’ search algorithm that
never reconsiders the choices it makes.
MODEL ENSEMBLES:
Combinations of models are generally known as model ensembles. They are among the
most powerful techniques in machine learning, often outperforming other methods. This
comes at the cost of increased algorithmic and model complexity.
In essence, ensemble methods in machine learning have the following two things in
common:
 they construct multiple, diverse predictive models from adapted versions of the
training data (most often reweighted or re-sampled);
 they combine the predictions of these models in some way, often by simple averaging
or voting (possibly weighted).
Bagging and random forests
Bagging, short for ‘bootstrap aggregating’, is a simple but highly effective ensemble
method that creates diverse models on different random samples of the original data set.
These samples are taken uniformly with replacement and are known as bootstrap samples.
Because samples are taken with replacement the bootstrap sample will in general contain
duplicates, and hence some of the original data points will be missing even if the bootstrap
sample is of the same size as the original data set. To get an idea of how different
bootstrap samples might be, we can calculate the probability that a particular data point is
not selected for a bootstrap sample of size n as (1−1/n)n, which for n = 5 is about one-
third and has limit 1/e = 0.368 for n→∞.
Algorithm 11.1 gives the basic bagging algorithm, which returns the ensemble as a set of
models. We can choose to combine the predictions from the different models by voting
the class predicted by the majority of models wins or by averaging, which is more
appropriate if the base classifiers output scores or probabilities. An illustrate on is given in
Figure 11.1. I trained five basic linear classifiers on bootstrap samples from 20 positive
and 20 negative examples. We can clearly see the diversity of the five linear classifiers,
which is helped by the fact that the data set is quite small. The figure demonstrates the
difference between combining predictions through voting (Figure 11.1 (left)) and creating
a probabilistic classifier by averaging (Figure 11.1 (right)).

Figure 11.1. (left) An ensemble of five basic linear classifiers built from bootstrap samples with bagging.
The decision rule is majority vote, leading to a piecewise linear decision boundary. (right) If we turn the
votes into probabilities, we see the ensemble is effectively a grouping model: each instance space
segment obtains a slightly different probability.
With voting we see that bagging creates a piecewise linear decision boundary, something
that is impossible with a single linear classifier. If we transform the votes from each
model into probability estimates, we see that the different decision boundaries partition
the instance space into segments that can potentially each receive a different score.
Bagging is particularly useful in combination with tree models, which are quite sensitive
to variations in the training data. When applied to tree models, bagging is often combined
with another idea: to build each tree from a different random subset of the features, a
process also referred to as subspace sampling. This encourages the diversity in the
ensemble even more, and has the additional advantage that the training time of each tree is
reduced. The resulting ensemble method is called random forests, and the algorithm is
given in Algorithm 11.2.
Since a decision tree is a grouping model whose leaves partition the instance space, so is a
random forest: its corresponding instance space partition is essentially the intersection of
the partitions of the individual trees in the ensemble. While the random forest partition is
therefore finer than most tree partitions, it can in principle be mapped back to a single tree
model (because intersection corresponds to combining the branches of two different
trees). This is different from bagging linear classifiers, where the ensemble has a decision
boundary that can’t be learned by a single base classifier. One could say, therefore, that
the random forest algorithm implements an alternative training algorithm for tree models.

BOOSTING
Boosting is an ensemble technique that is superficially similar to bagging, but uses a more
sophisticated technique than bootstrap sampling to create diverse training sets. The basic
idea is simple and appealing. Suppose we train a linear classifier on a data set and find
that its training error rate is € . We want to add another classifier to the ensemble that does
better on the misclassifications of the first classifier. One way to do that is to duplicate the
misclassified instances: if our base model is the basic linear classifier, this will shift the
class means towards the duplicated instances. A better way to achieve the same thing is to
give the misclassified instances a higher weight and to modify the classifier to take these
weights into account (e.g., the basic linear classifier can calculate the class means as a
weighted average).
But how much should the weights change? The idea is that half of the total weight is
assigned to the misclassified examples, and the other half to the rest. Since we started with
uniform weights that sum to 1, the current weight assigned to the misclassified examples
is exactly the error rate €, so we multiply their weights by 1/2€. Assuming € < 0.5 this is
an increase in weight as desired. The weights of the correctly classified examples get
multiplied by 1/2(1−€), so the adjusted weights again sum to 1. In the next round we do
exactly the same, except we take the non-uniform weights into account when evaluating
the error rate.

We need one more ingredient in our boosting algorithm and that is a confidence factor α
for each model in the ensemble, which we will use to form an ensemble prediction that is
a weighted average of each individual model. Clearly we want α to increase with
decreasing €: a common choice is
which we will justify in a moment. The basic boosting algorithm is given in Algorithm
11.3. Figure 11.2 (left) illustrates how a boosted ensemble of five basic linear classifiers
can achieve zero training error. It is clear that the resulting decision boundary is much
more complex than could be achieved by a single basic linear classifier. In contrast, a
bagged ensemble of basic linear classifiers has learned five very similar decision
boundaries, the reason being that on this data set the bootstrap samples are all very
similar.
I will now justify the particular choice for αt in Equation 11.1. First, I will show that the
two weight updates for the misclassified instances and the correctly classified instances
can be written as reciprocal terms δt and 1/δt normalised by some term Zt :

Figure 11.2. (left) An ensemble of five boosted basic linear classifiers with majority vote. The linear
classifiers were learned from blue to red; none of them achieves zero training error, but the ensemble
does. (right) Applying bagging results in a much more homogeneous ensemble, indicating that there is
little diversity in the bootstrap samples.
Boosted rule learning
An interesting variant of boosting arises when our base models are partial classifiers that
sometimes abstain from making a prediction. For example, suppose that our base
classifiers are conjunctive rules whose head is fixed to predicting the positive class. An
individual rule therefore either predicts the positive class for those instances it covers, or
otherwise abstains from making a prediction. We can use boosting to learn an ensemble of
such rules that takes a weighted vote among its members.
We need to make some small adjustments to the boosting equations, as follows. Notice
that is the weighted error of the t-th base classifier. Since our rules always predict
positive for covered instances, these errors only concern covered negatives, which we will
indicate by Similarly, we indicate the weighted sum of covered positives as ,
which will play the role of 1- . However, with abstaining rules there is a third
component, indicated as , which is the weighted sum of instances which the rule
doesn’t cover We then have

You might also like