Machine Learning-2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

UNIT II

Unit II: Supervised Learning: Rationale and Basics: Learning from Observations, Bias and Why Learning
Works: Computational Learning Theory, Occam's Razor Principle and Over fitting Avoidance Heuristic
Search in inductive Learning, Estimating Generalization Errors, Metrics for assessing regression, Metris for
assessing classification.

2.1 What is supervised learning?


Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically
supervised learning is a learning in which we teach or train the machine using data which is well labelled that
means some data is already tagged with the correct answer. After that, the machine is provided with a new set
of examples (data) so that supervised learning algorithm analyses the training data (set of training examples)
and produces a correct outcome from labelled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to train
the machine with all different fruits one by one like this:

(i) If shape of object is rounded and depression at top having colour Red then it will be labelled as –Apple.
(ii) If shape of object is long curving cylinder having colour Green-Yellow then it will be labelled as –
Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from basket and asked to
identify it.

Since the machine has already learned the things from previous data and this time have to use it wisely. It will
first classify the fruit with its shape and colour and would confirm the fruit name as BANANA and put it in
Banana category. Thus the machine learns the things from training data (basket containing fruits) and then
apply the knowledge to test data (new fruit).
Supervised learning classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” or
“disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.

2.2 Rationale and Basics:


2.2.1 Learning from observations:
There is some set of S of possible patterns or observations over which various output functions may be
defined. The training experience is available in the form of N patterns s(i) € S; i=1,2..., N. We specify a pattern
by a fixed number n of attributes/features xj; j=1,2,....,n; where each feature has real numerical value for the
pattern. The domain of xj is given by a set Vxj € R of its values. A data pattern s(i) has the feature-value set
{x1(i),....,xn(i)}, where xj(i) € Vxj. We can visualize each pattern with n numerical features as a point in n-
dimensional state space Rn.
x = [x1,x2......xn]T € Rn
The set X is the finite set of feature vectors x(i) for all the N patterns. We can visualize C=X as a region
of the state space Rn to which the pattern belong, i.e., X subset Rn. Note that x(i) is the representation of s(i);
and X is the representation space.
In general, learning is most reliable when the training example follow a distribution similar to that of
future patterns which the machine will receive; the machine will be required to give estimated output values ŷ
for these patterns. If the training experience consists of data that lies in a region of the state space, then that
region must be fully representative of situations over which the trained machine will later be used. The most
current theory of machine learning rests on the crucial assumption that the distribution of training examples is
identical to the distribution of unseen examples.

Empirical – Risk – Minimization:


The learning problem defined before is, in fact, intractable, since we do not know p(x, y) explicitly. All
that is available is a training dataset of examples drawn by independent sampling a (X * Y) space according to
some unknown probability distribution. Therefore, a learning machine can at best guarantee that the estimated
output values ŷ fit the true values y over the training data.
The key assumption for reliable learning is that the training examples and future examples are drawn
randomly and independently from some population of examples with the same probability distribution. If this
assumption is met, we can claim that a leaning machine giving 100% accuracy on training data will give high
accuracy on unseen data. However this assumption is not guaranteed. In fact, machine learning tasks based on
real – world data are unlikely to find the iid (independently and identically distributed) data assumption
tenable. The real – world data tend to be incomplete, noisy, and inconsistent.

Inductive learning:
The task of inductive learning is this:
Given a collection of example (x(i)f(x(i))); i=1,2,....,N, of a function f(x), returns a function h(x) that
approximates f(x).
In the statistical literature, the approximating function h(x) is called hypothesis function. The true
function f(x) correctly maps the input space X to the output space Y.
The function is not known in a real world decision making problem. Our satisfaction on an hypothesis
function h(x) with high accuracy on the given collection of examples of f(x) could be premature because
central aim of designing a machine is to suggest decisions when presented with novel patterns.
From a conceptual point of view, it is not easy to tell whether any particular h(.) is a good
approximation of f(.). A good approximation will generalize well that is, will predict novel patterns correctly.
Generalization performance is, thus, the fundamental problem in inducting learning.
How do we judge the generalization performance of an algorithm? We can estimate such performance
by means of a test dataset, sampled independently as is the training set. The off-training set error – the error
on points not in the training set, will be used as a measure of generalization performance.
The assumption in inductive learning is that the ideal hypothesis related to unseen patterns in the one
induced by the observed training data. Inductive learning techniques obtain a general hypothesis by seeking
empirical regularities over the training examples. These regularities induce the approximation of the mapping
function well over the observed examples. It generalizes from the specific training examples, hypothesizing a
general function covering these examples.
Inductive learning is also known as discovery learning, is a process where the learner discovers rules by
observing examples. This is a difference from deductive learning, where students are given rules that they
then need to apply.
We can often workout rules for ourselves by observing examples to see if there is a pattern; to see if
things regularly happen in the same way. We then by applying the rules in different situations to see if it
works with deductive language learning, tasks are designed specifically. To guide the learner and assist them
in discovering a rule.
Example: when children are first shown how to ride a bicycle. It’s not possible for them to cycle unaided
immediately. Parents guide and assist their children until they have guided the confidence and skills that
enable them to ride on their own.

2.3 Bias and Variance:


Bias and Variance are most easily understood in the context of regression. It is convenient to consider
the particular case of hypothesis function trained using the risk function, although our conclusions will be
much more general. Let us suppose the there is a true (yet unknown) function f(x) possessing continuous
valued output y with noise, and we try to estimate this function on the basis of N samples in set D j (random
sample) generated from f(x).
The regression function estimated is denoted h(x; Dj), and we are interested in the dependence of this
approximation on the training set Dj. Due to random variations in the selection of datasets Dj; j=1,...,K, for
some datasets, the approximation will be excellent, while for other datasets of the effectiveness of the
estimator can be expressed as its mean-square deviation from the desired optimal: [h(x; Dj) – f(x)]2.
A measure of how close the mapping function h(x; Dj) is to the desired one is, therefore, given by the
error function,
ErrorDj[h] = [h(x; Dj) – f(x)]2.
The value of this quantity will depend on the particular dataset D j on which it is trained. We write the
average over the complete ensemble of datasets as,
ErrorD[h] =ED{[h(x; Dj) – f(x)]2}.
A non-zero error can arise for essentially two reasons:
1. It may be that the hypothesis function h(.) is on average, different from the regression function f(x).
This is called bias.
2. It may be that the hypothesis function is very sensitive to the particular dataset D j, so that for a given x,
it is larger than the required value for some datasets, and smaller for other datasets. This is called
variance.

The bias measures the level to which the average of the hypothesis function is different from the desired
function f(x). The variance is a measure of the level to which the hypothesis function h(x; D j) is sensitive to
the specific selection of the dataset.

2.4 Why learning works: Computational Learning Theory:


Computational Learning Theory (CoLT) is a field of AI research studying the design of machine
learning algorithms to determine what sorts of problems are “learnable.” The ultimate goals are to understand
the theoretical underpinnings of deep learning programs, what makes them work or not, while improving
accuracy and efficiency.
(OR)
Computational Learning Theory (CLT) is a branch of statistics/machine learning/artificial intelligence
(in general) which deals with fundamental bounds and theorems about analyzing our (man and machine)
ability to learn rules and patterns from data.
What are some of the fundamental questions?
When studying machine learning it is natural to wonder what general laws may govern machine (and non-
machine) learners. For example,
 Is it possible to identify classes of learning problems that are inherently difficult or easy, independent
of the learning algorithm?
 Can one characterize the number of training examples necessary or sufficient to assure successful
learning?
 How is this number affected if the learner is allowed to pose queries to the trainer, versus observing a
random sample of training examples?
 Can one characterize the number of mistakes that a learner will make before learning the target
function?
 Can one characterize the inherent computational complexity of classes of learning problems?
But we can focus on a particular setting which comes up most often in any practical machine learning
task, that of — inductively learning an unknown target function, given only training examples of this
target function and a set of candidate hypotheses.
Within this setting, we are chiefly concerned with questions such as,
 How many training examples are sufficient to successfully learn the target function, and
 How many mistakes will the learner make before succeeding?
As we shall see, it is possible to set quantitative bounds on these measures, depending on attributes of the
learning problem such as,
 The size or complexity of the hypothesis space considered by the learner
 The accuracy to which the target concept must be approximated
 The probability that the learner will output a successful hypothesis
 The manner in which training examples are presented to the learner.
“Computational Learning Theory (CLT)” tries to answer fundamental questions about the limits and
possibilities of the whole enterprise of machine learning.

Illustrations of the Keywords:


For the sake of a comprehensive discussion, here is my attempt to define the basic keywords generally used in
CLT.

 Data: A given set of examples from which we are trying to learn a target concept.

 Target concept (aka rule): This is the hidden pattern we are trying to learn from the data for example
— “We go out for lunch IF it is sunny AND we got friends OR IF favourite cuisine restaurant is
nearby AND we got friends AND even IF it is NOT sunny”. This kind of rule is written in terms of
statements belonging to propositional logic i.e. Truth values expressed with AND, OR, NOT.
 Hypothesis space: This is a set of hypotheses from which we hope to discover the target concept. In
this example, the ‘intelligent’ choice of hypotheses should look like the logically connected phrase we
wrote above as the target concept. But they could be simpler or different looking such as,
 IF it is sunny
 IF we got friends OR it is sunny
These two hypotheses are not rich enough to capture the true target concept and they will have
errors if applied to the given data.
Why?
Because they are not of the form “conjunction of disjunctive statements” i.e. (X1 & X2) | (X3 & X4) |
(X5 & !X6). They are too simplistic and general and unable to capture all the nuance of the data presented. To
search a target concept in the machine learning task, it depends largely on the richness and complexity of the
hypothesis space you choose to work with.
 Predictor and response variables: These are self-explanatory. The ‘Sunny?’, ‘Favourite restaurant
distance’ and ‘Got friends?’ are predictor variables and ‘Go out for lunch’ is the response variable. If
you read the conditions satisfied by the edges of the tree starting from the root and down to a leaf, you
will automatically get propositional logic statements

For the most part, CLT focuses not on individual learning algorithms, but rather on broad classes of learning
algorithms characterized by the hypothesis spaces they consider, the presentation of training examples, etc.
The key quantities it tries to reason about are,
Sample complexity: How many training examples are needed for a learner
to converge (with high probability) to a successful hypothesis?
Computational complexity: How much computational effort is needed for a learner to converge (with
high probability) to a successful hypothesis?
Mistake bound: How many training examples will the learner misclassify

 Probably Approximately Correct (PAC):


Let H be the hypothesis space, D be the unknown true distribution, and c be the target concept.
Let us say that after observing some data d, the learning algorithm L outputs a hypothesis h which it thinks
is the best approximation of c.
What is the measure of error/success here? Error is where the prediction of c and h disagrees.
Our aim is to characterize classes of target concepts that can be reliably learned from a reasonable
number of randomly drawn training examples and a reasonable amount of computation. We don’t have
infinite time, compute power, storage, or sensor bandwidth. Therefore, we don’t try to make the error
zero. For that, we need to see every instance possible as generated by D. For the same reason, we restrict
ourselves to examining limited training samples and using a polynomial-time algorithm. Moreover, given that
the training examples are drawn randomly, there will always be some nonzero probability that the training
examples encountered by the learner will be misleading.

Therefore, we will require only that its error be bounded by some constant,
that can be made arbitrarily small. Moreover, we will not require that the learner succeed for every sequence
of randomly drawn training examples, instead, our demand will only be that its probability of failure be
bounded by some constant that can be made arbitrarily small.
 Sample complexity bound:
The bound is given by,

This is called Haussler bound. This inequality readily demonstrates the following facts which we
experience in practice for any machine learning task,
 We need higher number of training data to reduce test/generalization error.
 The algorithm needs to ‘see’ a higher number of training examples to increase its confidence in
learning i.e. reduce the probability of mistake.
 A richer and bigger hypothesis space means a larger dimension to search through in the quest
of the correct hypothesis, and that needs higher number of training examples.

2.5 Occam's Razor Principle and Over fitting Avoidance:


The Franciscan monk, William of Occam, was born in 1280, much before the invention of modern
statistics. His name linked with machine learning through the basic idea “the simpler explanations are more
reasonable, and any unnecessary complexity should be shaved off”. This line of reasoning- occam’s razor
principle - is his contribution to the domain of machine learning. If there are two algorithms and both of them
perform equally well on the training set, then according to Occam’s Razor principle, the simple algorithm can
expected to do better on a test set; ‘simpler’ may imply needing lesser parameters, lesser training time, fewer
attributes for data representation, and lesser computational complexity.
The main idea is that simplest and most direct solution should be preferred, or with that hypothesis
the simplest one or the one with fewest assumptions will be best applied. However, Occam’s Razor also has
some modern applications to state-of-art technologies are one example is the application of the principal to
machine learning. With machine learning, engineers work to train computers on sets of training data to enable
them to learn and go beyond the limits of their original codebase programming. Machine learning evolves
implementing algorithms, data structures and training systems to capture, to allow them, to learn on their own
and produce evolving results.
Simplification procedures such as “feature selection” and “dimensionality reduction” are also the
examples of using an Occam’s Razor principle of simplifying models to get better result. There are two parts
that are considered the basics of Occam’s Razor and are originally written in latin:
 The principle of plurality: plurality should not be posited without necessity.
 The principle of parsimony: it is pointless to do with more what is done with less.

Overfitting:
Overfitting is a statistical model is said to be overfitted when we train it with a lot of data. When model
gets trained with so much of data, it starts learning from the noise and inaccurate data entire in our dataset.
Then the model does not categorize the data correctly, because of too much of details and noise. The causes of
overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms
have more freedom in building the model based on the dataset and therefore they can really build unrealistic
models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.

The above graph depicts the effect of overfitting in a typical application of machine learning. The horizontal
axis of the plot shows how complex the classifier is. The vertical axis is an indicator of how accurate the
forecasts made by the classifier are. The line depicts how accurate the classifier is over the training examples,
while the broken line shows the accuracy measured over an independent set of test examples. Predictably, the
accuracy of the classifier over training examples increases monotonically as the classifier becomes more
complex. But, the accuracy over the independent test examples first grows, then reduces.
How to avoid Overfitting:
The commonly used methodologies are:
 Cross- Validation: A standard way to find out-of-sample prediction error is to use 5-fold cross
validation.
 Early Stopping: Its rules provide us the guidance as to how many iterations can be run before learner
begins to over-fit.
 Pruning: Pruning is extensively used while building related models. It simply removes the nodes
which add little predictive power for the problem in hand.
 Regularization: It introduces a cost term for bringing in more features with the objective function.
Hence it tries to push the coefficients for many variables to zero and hence reduce cost term.
Underfitting:-
A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture
the underlying trend of the data. Underfitting destroys the accuracy of our machine learning model. Its
occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens
when we have less data to build an accurate model and also when we try to build a linear model with a non-
linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on
such a minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can
be avoided by using more data and also reducing the features by feature selection.

2.6 Heuristic search in inductive learning:


2.6.1 Search through hypothesis space:
The word ‘heuristic’ indicates that strategies for the search problem cannot be precisely pre-defined.
Trail and error is the approach of searching for a ‘good enough’ solution. This approach, however, gets more
organized if we exploit the prior knowledge about the learning domains and experience with a broad range of
algorithms. Applied machine learning organizes the search as per the following two step procedure.
 The search is first focused on a class of the possible hypotheses, chosen for the learning task in
hand. Prior knowledge and experience are helpful in this selection. Different classes are
appropriate for different kinds of learning tasks.
 For each of the classes, the corresponding learning algorithm organizes the search through all
possible structures of the learning machine.

The principal techniques used in heuristic search to optimize hypothesis complexity for a given training
dataset.

Regularization: The regularization model promotes smoother functions by creating a new criterion function
that relies not only on the training error, but also on algorithmic intricacy. Particularly, the new criterion
function punishes extremely complex hypothesis; looking for the minimum in this criterion is to balance error
on the training set with complexity. Formally it is possible to write the new criterion as a sum of the error on
the training set plus a regularization term, which depicts constraints or sought after properties of solution.

E’= error on data + λ * hypothesis complexity where λ gives the weight of penalty (Ω)
The second term penalizes complex hypotheses with large variance. When we minimize augmented
error function instead of the error on data only, we penalize complex hypothesis and thus decrease variance.
When is taken too large, only very simple functions are allowed and we risk introducing bias. It is optimized
using cross-validation.

Early Stopping: The training of a learning machine corresponds to iterative decrease in the error function
defined as per the training data. During a specific training session, this error generally reduces as a function of
the number of iterations in the algorithm. Stopping the training before attaining a minimum training error,
represents a technique of restricting the effective hypothesis complexity.

Pruning: An alternative solution that sometimes is more successful than early stopping the growth of the
hypothesis is pruning the full-grown hypothesis that is likely to be overfitting the training data. Pruning is the
basis of search in many decision-tree algorithms; weakest branches of large tree overfitting the training data,
which hardly reduces the error rate are removed.

2.6.2 Ensemble learning:


For the given amount and quality of training data, the output of one hypothesis function may be
inappropriate for the problem at hand. The ideal model to make more reliable decisions is to create a
combination of outputs of many different hypothesis. Many machine learning algorithms do this by learning
an ensemble of hypothesis and employing them in a combined form. Bagging and boosting are the most
frequently used among these schemes. These general methods may be applied to classification and regression
problems.
In the bagging technique, individual approaches are constructed separately, whereas in boosting, each
new model is impacted by the performance of those built earlier. In boosting we first make a model with
accuracy on the training set greater than average and then add new component classifier to make an ensemble
whose joint decision rule possesses a high level of accuracy on the training set. Ada Boost is Adaptive
Boosting is widely used algorithm for boosting.
Yet another ensemble technique is known as random forest . suppose each classifier in the ensemble
is a decision tree classifier. Therefore, the set of such classifiers give rise to a forest. The individual decision
trees are created with the help of random choice of attributes at each node to identify the split. More formally,
each tree is dependent on the values of a random vector independently sampled and with the same distribution
for all trees in the forest. During classification, each tree votes and the class that is highly popular is returned.
Class – imbalanced problems:
Ensemble methods have been used to solve class – imbalanced problems. Two – class data are class
– imbalanced if the primary class of interest is represented by just a few samples in the dataset, whereas the
majority of the samples represent the other class.
For multiclass – imbalanced data, the data distribution of each class is substantially different, where
again the primary class or classes of interest are represented by merely a few samples. The class – imbalanced
problem is very similar to cost sensitive learning, wherein the costs of error per class, are unequal.
The general approaches for improvement of the classification performance of class – imbalanced data
are:
 Oversampling
 Undersampling
 Threshold moving
 Ensemble methods
2.6.3 Evaluation of a learning system:
 Accuracy: Inductive learning is based on empirical data. The learning system extracts knowledge
from the training data. The learned knowledge should be general enough to deal with unknown data.
The generalization capability of a learning system is an index of accuracy of the learning machine.
 Robustness: Robustness means that the machine can perform adequately under all circumstances,
including the cases when information is corrupted by noise, is incomplete, and is interfered with
irrelevant data.
 Computational complexity and speed: Computational complexity of a learning algorithm and
learning speed determined the efficiency of a learning system.
 Online learning: This system can continue to assimilate new data.
 Interpretability: It is subjective and hence tougher to evaluate. Interpretation is easy in fuzzy logic
systems and decision trees.

2.7 Estimating Generalization Errors:


The success of learning depends on the hypothesis space complexity and sample complexity. The two
are interdependent. The goal is to find a function simplest in terms of complexity and best in terms of
empirical error on the data. Such a choice is expected to give good generalization performance. Basically, a
hypothesis function of complexity consistent with the given training data is the problem in hand.
2.7.1 Holdout Method and Random Subsampling:
In the holdout techniques, some amount of data is earmarked for the purpose of testing, while the
remainder is employed for training. In order to divide dataset D into training and test sets, we can arbitrarily
sample a set of training examples from D, and keep aside the remaining for testing. If the data is collected
over time, then we can make use of the earlier part to train and the latter part of the data for the purpose of
testing.
In a single holdout process, we may look at considering swapping the roles of training and test data - that
is, train the machine on the test data, and estimate the success rate making use of the training data – and
average the two results, thereby decreasing the effect of uneven distribution in training and test sets. This is
however, use with a 50:50 split of the full data between training and test sets, which is usually not ideal- it is
ideal to employ more than 50 % of the data for training even at the cost of test data. But a simple variant
becomes the basis of a powerful statistical method, known as cross- validation.
2.7.2 Cross validation:
A commonly used technique for forecasting the success rate of a learning method, taking into
account a fixed data sample, is the k-fold cross validation.
 k-Fold cross validation: In k-fold cross validation, the given data D is randomly divided into k
mutually exclusive subsets or folds, Dk; k=1,2,...,k, each of about equal size. Training and testing is
done k times. In iteration km partition Dk is set aside for testing and the remainder of the divisions are
collectively employed to train the model. That is, in the first iteration, the set D 2 U D3 U......U Dk
serves as the training set to attain the first model, which is tested on D 1; the second iteration is trained
on D1 U D3 U......U Dk and so on. If stratification is also used it is known as stratified k-fold cross
validation for classification.
 Leave-One-Out cross validation: This is an exceptional case of k-fold cross validation where in k is
set to the number N of initial tuples. In other words, only a single sample is ‘left out’ for the test set in
each iteration. The learning machine is trained on the remainder of the samples. It is judged by its
accuracy on the left-out sample. The average of all outcomes of all N judgements in N iterations is
taken, and this is the average which is representative of the error estimate.
2.7.3 Bootstrapping:
This technique is based on the process of sampling with replacement. In the earlier techniques,
whenever a sample was used from the dataset to form a training or test set, it was never replaced. In other
words, the same instance, which was once chosen could not be chosen again. However, most learning
techniques can employ an instance several times, and it affects the learning outcome if it is available in the
training set more than once. The concept of bootstrapping aims to sample the dataset by replacement, so as to
form a training set and test set. There are many bootstrap techniques. The most popular one is the 0.632
bootstrap.

2.8 Metrics for assessing regression:


A function,
f =X → Y; f (x) = y

maps from x € X and y € Y, where X is the input space and Y is the output space of the function. We assume
here that X is subset of Rn, and Y is subset of R. The data is available as samples (x, Y) where the distribution
of inputs x and the function f are both unknown.
The task is to find the model h(x) that explains the underlying data, i.e., h(x) = y for all samples (x, y).
Equivalently, the task is to approximate function f(x) with unknown properties by h(x).
The term used in statistics for function description of data is regression. Also, since h(x) predicts y € R for
a given x, the term numeric prediction is also in use. There are three terms called function approximation,
numeric prediction, and regression used interchangeably without any discrimination.
Estimating the error in prediction using holdout and random subsampling, cross validation and boostrap
methods are common techniques for assessing accuracy of the predictor. Several alternative metrics can be
used to assess the accuracy of numeric prediction.
2.8.1 Mean Square Error:
Mean Square Error (MSE) is the principal and most commonly used metric. For calculating MSE, we
assume that no statistical information on data is available; the mean is obtained from the training data as
arithmetic average.

This measures the average square deviation of the predicted values from true values.
Root Mean Square Error:
Taking the square root of MSE,

RMSE more clearly relates to individual errors; it has same dimensions as the predicted value itself.
Sum-of-Error Squares:
Sometimes total error, and not the average, is taken for mathematical manipulation by some statistical /
machine learning techniques:

(Expected) Mean Square Error:


In general probabilistic setting, when input and output variables are random variables, the following
metric is used:

Where, E is the statistical expectation operator.

2.8.2 Mean Absolute Error:


Given the N samples (x(i), y(i)), an intuitive measure to assess the quality of a mode h(x, w) is the Mean
Absolute Error (MAE) computed as,

This function measures the average deviation of the predicted value from the true value. In MAE, variance of
the model error is not accounted for. Suppose model h1(w, x) depicts small errors over the full range of the
data. They may as will have the same MAE. When each data point possesses the same significance, mode
h2(w, x) may be preferred as its errors possess lower variability. The amount of variability is considered by
MSE criterion.

2.9 Metrics for Assessing Classification:


In machine learning, the focus is traditionally on assigning a label to given input pattern. In statistics,
the introduction of discriminant analysis was done for this purpose: referred to as classification; it aims to
allocate each input pattern to one of the given classes. Substantial evolution in all these areas has become
progressively similar due to integration of development of ideas with each other.
The basic principle is to use of an independent test dataset instead of the training set to evaluate
performance, the holdout technique and cross validation are equally applicable to classification. However, the
basic quality measures in terms of error estimates in numeric prediction are not appropriate any more. The
errors in numeric prediction arise in various sizes whereas in classification, errors simply exist or are absent.
Several different measures can be used to assess the accuracy of a classifier.
2.9.1 Misclassification Error:
The metric for assessing the accuracy of classification algorithm is: number of samples misclassified by
the model h(w, x). For example, for binary classification problems,

This accuracy measure works well for the situations where class tuples are more or less evenly distributed.
However, when the classes are unbalanced, decisions made on classifications based on misclassification error
lead to poor performance.

2.9.2 Confusion matrix:


In the field of machine learning and specifically the problem of statistical classification, a confusion
matrix, also known as an error matrix. A confusion matrix is a table that is often used to describe the
performance of a classification model (or “classifier”) on a set of test data for which the true values are
known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabelled as the
other. Most performance measures are computed from the confusion matrix.

Here,
 Class 1 : Positive
 Class 2 : Negative
Definition of the Terms:
 Positive (P) : Observation is positive (for example: is an apple).
 Negative (N) : Observation is not positive (for example: is not an apple).
 True Positive (TP) : Observation is positive, and is predicted to be positive.
 False Negative (FN) : Observation is positive, but is predicted negative.
 True Negative (TN) : Observation is negative, and is predicted to be negative.
 False Positive (FP) : Observation is negative, but is predicted positive.

Classification Rate/Accuracy:
Classification Rate or Accuracy is given by the relation:

However, there are problems with accuracy. It assumes equal costs for both kinds of errors. A 99% accuracy
can be excellent, good, mediocre, poor or terrible depending upon the problem.
Recall: Recall can be defined as the ratio of the total number of correctly classified positive examples divide
to the total number of positive examples. High Recall indicates the class is correctly recognized (a small
number of FN).

Precision:

To get the value of precision we divide the total number of correctly classified positive examples by the total
number of predicted positive examples. High Precision indicates an example labelled as positive is indeed
positive (a small number of FP).
High recall, low precision: This means that most of the positive examples are correctly recognized (low FN)
but there are a lot of false positives.
Low recall, high precision: This shows that we miss a lot of positive examples (high FN) but those we
predict as positive are indeed positive (low FP)
F-measure: Since we have two measures (Precision and Recall) it helps to have a measurement that
represents both of them. We calculate an F-measure which uses Harmonic Mean in place of Arithmetic Mean
as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.

Let’s consider an example now, in which we have infinite data elements of class B and a single element of
class A and the model is predicting class A against all the instances in the test data.
Here,

Precision : 0.0
Recall : 1.0
Now:
Arithmetic mean: 0.5
Harmonic mean: 0.0
When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome!
While taking the harmonic mean, the F-measure is 0.
Example to interpret confusion matrix:
For the simplification of the above confusion matrix I have added all the terms like TP, FP, etc and the row
and column totals in the following image:

Now,
Classification Rate/Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN) = (100 + 50) /(100 + 5 + 10 + 50) = 0.90
Recall: Recall gives us an idea about when it’s actually yes, how often does it predict yes.
Recall = TP / (TP + FN) = 100 / (100 + 5) = 0.95
Precision: Precision tells us about when it predicts yes, how often is it correct.
Precision = TP / (TP + FP)=100/ (100+10) = 0.91
F-measure:
F-measure = (2 * Recall * Precision) / (Recall + Precision) = (2 * 0.95 * 0.91) / (0.91 + 0.95) = 0.92

2.9.3 Comparing Classifiers Based on ROC Curves:


The ROC Curve was first used during World War II for the analysis of radar signals. After the attack on
Pearl Harbor, the US army began new research to improve the rate of detection of Japanese aircraft from their
radar signals. Needless to say, they did not want to miss any of them. Neither did they wish to waste their
resources on false alarms. They measured the ability of a radar receiver operator to make these predictions
called the Receiver Operating Characteristic. That is the origin of the name. The purpose of the curve was
similar to how we use it to improve our machine learning models now. The aim was to analyse the predictive
power of the predictor in ensuring the detection of as many true positives as possible while minimizing false
positives.
The ROC curve is a plot of True Positive Rate (TPR) on the y-axis vs False Positive Rate (FPR) on the x-
axis.
TPR = Sensitivity
FPR = 1-Specificity
It is better to understand ROC Curve in their original form, TPR Vs FPR.
For example we have a binary classification model that predicts probability of observations belonging to
Class YES and NO. We choose a probability threshold = 0.5. Observations with probability >= 0.5 belong to
Class Yes and probability <0.5 belong to Class NO. We obtain the following confusion matrix.

Notice in the figure below that TPR and FPR are derived from two separate rows of the confusion matrix.
This ensures that the class distribution in the data (prevalence) does not affect the curve. The YESs and the
NOs are being looked at separately.

The True Positive Rate looks at the actual YESs and the False Positive Rate looks at the actual NO’s. So, even
though the ROC plot contains information about both classes (YES and NO), these groups don’t interact. Each
value, and therefore the curve itself, does not change even if the prevalence/distribution of classes in test data
varies.
Purpose of ROC Curve
a) Purpose 1 — Analysing the strength/predictive power of a classifier
The job of our classification model is to assign higher probabilities to observations that belong to class
YES and lower probabilities to observations that belong to class NO. Basically, if there is a substantial
distinction in the probabilities assigned to observations of either class, we will be able to define a threshold
and predict the two classes with high confidence. The ROC curve of a good classifier is closer to the top left
of the graph. WHY? Because in a good classifier, we can achieve high TPR at low FPR.

The green curve shows the distribution of Class YES and the red curve shows distribution of Class NO over
the probability assigned to each observation by the classifier. Using threshold 0.5 we are able to classify most
observations correctly. There is a slight overlap of the distributions. The overlapping cases are the ones that
get misclassified at 0.5. If we move the threshold to 0.6, we will be able to classify almost 85% true positives
correctly and with 0 false positives. As we lower the threshold from 0.6 to 0.5, the TPR will increase but so
will the FPR.

We can use a colour coded ROC curve like below to be able to see each threshold on the curve. The color bar
on the right indicates the threshold value at each colour.

b) Purpose 2 — determining optimal threshold


We have a classification model that predicts the probability that an observation belongs to Class YES.
This probability can range from 0 to 1. The default threshold of 0.5 that is used to determine the class of this
observation is not always the best threshold. The ROC curve helps us find the threshold where the TPR is high
and FPR is low i.e. misclassifications are low. Therefore, ROC curves should be used to determine the
optimal probability threshold for a classification model.
c) Purpose 3 — Comparing two models (using Area Under the Curve)
In an ROC Curve, the diagonal represents the baseline model/random classifier. The closer an ROC
curve comes to the 45-degree diagonal of the ROC space, the less powerful is the model.

You might also like