Discriminative Generative: R Follow A
Discriminative Generative: R Follow A
Discriminative Generative: R Follow A
This is called conditional likelihood to stress that it gives us the conditional probability
P(yi |xi ) rather than P(xi ) as in a generative model. Notice that our use of the product
requires the assumption that the y-values are independent given x; but this is an entirely
reasonable assumption and not nearly as strong as the naive Bayes assumption of x being
independent within each class. As usual, the logarithm of the likelihood function is easier
to work with:
Although these equations do not yield an analytic solution, they can be used to obtain
further insight into the nature of logistic regression. Concentrating on t, we first need to do
some algebraic groundwork.
For the optimal solution this partial derivative is zero. What this means is that, on average,
the predicted probability should be equal to the proportion of positives pos. This is a
satisfying result, as it is clearly a desirable global property of a calibrated classifier.
FEATURES
KINDS OF FEATURE:
Consider two features which are integers, one describing a person’s age and the other their
house number, but the way we use those features can be quite different. Calculating the
average age of a group of people is meaningful, but an average house number is probably
not very useful. These depend on whether the feature values are expressed on a
meaningful scale.
House numbers are not really integers but ordinals: we can use them to determine that
number 10’s neighbours are number 8 and number 12, but we cannot assume that the
distance between 8 and 10 is the same as the distance between 10 and 12. Because of the
absence of a linear scale it is not meaningful to add or subtract house numbers, which
excludes operations such as averaging.
Calculations on features:
Three main categories are statistics of central tendency, statistics of dispersion and shape
statistics. The most important of statistics of central tendency are
the mean or average value; to calculate mean we need a feature expressed on some
linear scale
the median, which is the middle value if we order the instances from lowest to highest
feature value, the median tends to lie between the mode and the mean with few
exceptions to this rule.
the mode, which is the majority value or values which we can calculate whatever the
domain of the feature.
Two well-known statistics of dispersion are the variance or average squared deviation
from the mean, and its square root, the standard deviation.
Variance and standard deviation essentially measure the same thing, but the latter has
the advantage that it is expressed on the same scale as the feature itself. For example,
the variance of the body weight in kilograms of a group of people is measured in kg 2,
whereas the standard deviation is measured in kilograms.
A simpler dispersion statistic is the difference between maximum and minimum value,
which is called the range. A midrange point is the mean of the two extreme values.
Other statistics of dispersion include percentiles. The p-th percentile is the value such
that p per cent of the instances fall below it. If we have 100 instances, if p is a multiple
of 25 the percentiles are also called quartiles, and if it is a multiple of 10 the
percentiles are also called deciles. Note that the 50th percentile, the 5th deciles and the
second quartile are all the same as the median. Percentiles, deciles and quartiles are
special cases of quantiles. Once we have quantiles we can measure dispersion as the
distance between different quantiles. For instance, the inter quartile range is the
difference between the third and first quartile (i.e., the 75th and 25th percentile).
The skew and peakedness of a distribution can be measured by shape statistics such as
skewness and kurtosis.
The main idea is to calculate the third and fourth central moment of the sample. In
general, the k-th central moment of a sample {xi,... xn} is defined as
where μ is the sample mean.
Clearly, the first central moment is the average deviation from the mean; this is always
zero, as the positive and negative deviations cancel each other out; and the second
central moment is the average squared deviation from the mean, otherwise known as
the variance. The third central moment m3 can again be positive or negative.
Skewness is then defined as m3/σ3, where σ is the sample’s standard deviation. A
positive value of skewness means that the distribution is right-skewed, which means
that the right tail is longer than the left tail. Negative skewness indicates the opposite,
left-skewed case.
Kurtosis is defined as m4/σ4. As it can be shown that a normal distribution has kurtosis,
people often use excess kurtosis m4/σ4−3 as the statistic of interest. Briefly, positive
excess kurtosis means that the distribution is more sharply peaked than the normal
distribution.
FEATURE TRANSFORMATIONS
A feature transformation is improving the utility of a feature by removing, changing or
adding information. We could classify feature types by, quantitative features are more
detailed than ordinal ones, followed by categorical features, and finally Boolean features.
The best-known feature transformations are those that turn a feature of one type into
another of the next type down this list. But there are also transformations that change the
scale of quantitative features, or add a scale to ordinal, categorical and Boolean features.
Table 10.2 introduces the terminology we will be using.
Binarisation transforms a categorical feature into a set of Boolean features, one for
each value of the categorical feature. This loses information since the values of a single
categorical feature are mutually exclusive, but is sometimes needed if a model cannot
handle more than two feature values.
Unordering trivially turns an ordinal feature into a categorical one by discarding the
ordering of the feature values. This is often required since most learning models cannot
handle ordinal features directly. An interesting alternative that we will explore below is
to add a scale to the feature by means of calibration.
Incomplete features
Consider what to do if we don’t know a feature’s value for some of the instances.
Example is how to classify an e-mail if we didn’t know whether it contained one of the
vocabulary words or not. Probabilistic models handle this rather gracefully by taking a
weighted average over all possible values of the feature:
Here, Y is the target variable as usual, X stands for the features that are observed for the
instance to be classified, while Z are the features that are unobserved at classification
time. The distribution P(Z) can be obtained from the trained model, at least for a
generative model ,if our model is discriminative we need to estimate P(Z) separately.
Missing feature values at training time are trickier to handle. First of all, the very fact that
a feature value is missing may be correlated with the target variable. For example, the
range of medical tests carried out on a patient is likely to depend on their medical history.
For such features it may be best to have a designated ‘missing’ value so that, for instance,
a tree model can split on it. However, this would not work for, say, a linear model. In such
cases we can complete the feature by ‘filling in’ the missing values, a process known as
imputation.
Figure 11.1. (left) An ensemble of five basic linear classifiers built from bootstrap samples with bagging.
The decision rule is majority vote, leading to a piecewise linear decision boundary. (right) If we turn the
votes into probabilities, we see the ensemble is effectively a grouping model: each instance space
segment obtains a slightly different probability.
With voting we see that bagging creates a piecewise linear decision boundary, something
that is impossible with a single linear classifier. If we transform the votes from each
model into probability estimates, we see that the different decision boundaries partition
the instance space into segments that can potentially each receive a different score.
Bagging is particularly useful in combination with tree models, which are quite sensitive
to variations in the training data. When applied to tree models, bagging is often combined
with another idea: to build each tree from a different random subset of the features, a
process also referred to as subspace sampling. This encourages the diversity in the
ensemble even more, and has the additional advantage that the training time of each tree is
reduced. The resulting ensemble method is called random forests, and the algorithm is
given in Algorithm 11.2.
Since a decision tree is a grouping model whose leaves partition the instance space, so is a
random forest: its corresponding instance space partition is essentially the intersection of
the partitions of the individual trees in the ensemble. While the random forest partition is
therefore finer than most tree partitions, it can in principle be mapped back to a single tree
model (because intersection corresponds to combining the branches of two different
trees). This is different from bagging linear classifiers, where the ensemble has a decision
boundary that can’t be learned by a single base classifier. One could say, therefore, that
the random forest algorithm implements an alternative training algorithm for tree models.
BOOSTING
Boosting is an ensemble technique that is superficially similar to bagging, but uses a more
sophisticated technique than bootstrap sampling to create diverse training sets. The basic
idea is simple and appealing. Suppose we train a linear classifier on a data set and find
that its training error rate is € . We want to add another classifier to the ensemble that does
better on the misclassifications of the first classifier. One way to do that is to duplicate the
misclassified instances: if our base model is the basic linear classifier, this will shift the
class means towards the duplicated instances. A better way to achieve the same thing is to
give the misclassified instances a higher weight and to modify the classifier to take these
weights into account (e.g., the basic linear classifier can calculate the class means as a
weighted average).
But how much should the weights change? The idea is that half of the total weight is
assigned to the misclassified examples, and the other half to the rest. Since we started with
uniform weights that sum to 1, the current weight assigned to the misclassified examples
is exactly the error rate €, so we multiply their weights by 1/2€. Assuming € < 0.5 this is
an increase in weight as desired. The weights of the correctly classified examples get
multiplied by 1/2(1−€), so the adjusted weights again sum to 1. In the next round we do
exactly the same, except we take the non-uniform weights into account when evaluating
the error rate.
We need one more ingredient in our boosting algorithm and that is a confidence factor α
for each model in the ensemble, which we will use to form an ensemble prediction that is
a weighted average of each individual model. Clearly we want α to increase with
decreasing €: a common choice is
which we will justify in a moment. The basic boosting algorithm is given in Algorithm
11.3. Figure 11.2 (left) illustrates how a boosted ensemble of five basic linear classifiers
can achieve zero training error. It is clear that the resulting decision boundary is much
more complex than could be achieved by a single basic linear classifier. In contrast, a
bagged ensemble of basic linear classifiers has learned five very similar decision
boundaries, the reason being that on this data set the bootstrap samples are all very
similar.
I will now justify the particular choice for αt in Equation 11.1. First, I will show that the
two weight updates for the misclassified instances and the correctly classified instances
can be written as reciprocal terms δt and 1/δt normalised by some term Zt :
Figure 11.2. (left) An ensemble of five boosted basic linear classifiers with majority vote. The linear
classifiers were learned from blue to red; none of them achieves zero training error, but the ensemble
does. (right) Applying bagging results in a much more homogeneous ensemble, indicating that there is
little diversity in the bootstrap samples.
Boosted rule learning
An interesting variant of boosting arises when our base models are partial classifiers that
sometimes abstain from making a prediction. For example, suppose that our base
classifiers are conjunctive rules whose head is fixed to predicting the positive class. An
individual rule therefore either predicts the positive class for those instances it covers, or
otherwise abstains from making a prediction. We can use boosting to learn an ensemble of
such rules that takes a weighted vote among its members.
We need to make some small adjustments to the boosting equations, as follows. Notice
that is the weighted error of the t-th base classifier. Since our rules always predict
positive for covered instances, these errors only concern covered negatives, which we will
indicate by Similarly, we indicate the weighted sum of covered positives as ,
which will play the role of 1- . However, with abstaining rules there is a third
component, indicated as , which is the weighted sum of instances which the rule
doesn’t cover We then have