Machine Learning-2
Machine Learning-2
Machine Learning-2
Unit II: Supervised Learning: Rationale and Basics: Learning from Observations, Bias and Why Learning
Works: Computational Learning Theory, Occam's Razor Principle and Over fitting Avoidance Heuristic
Search in inductive Learning, Estimating Generalization Errors, Metrics for assessing regression, Metris for
assessing classification.
(i) If shape of object is rounded and depression at top having colour Red then it will be labelled as –Apple.
(ii) If shape of object is long curving cylinder having colour Green-Yellow then it will be labelled as –
Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from basket and asked to
identify it.
Since the machine has already learned the things from previous data and this time have to use it wisely. It will
first classify the fruit with its shape and colour and would confirm the fruit name as BANANA and put it in
Banana category. Thus the machine learns the things from training data (basket containing fruits) and then
apply the knowledge to test data (new fruit).
Supervised learning classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” or
“disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
Inductive learning:
The task of inductive learning is this:
Given a collection of example (x(i)f(x(i))); i=1,2,....,N, of a function f(x), returns a function h(x) that
approximates f(x).
In the statistical literature, the approximating function h(x) is called hypothesis function. The true
function f(x) correctly maps the input space X to the output space Y.
The function is not known in a real world decision making problem. Our satisfaction on an hypothesis
function h(x) with high accuracy on the given collection of examples of f(x) could be premature because
central aim of designing a machine is to suggest decisions when presented with novel patterns.
From a conceptual point of view, it is not easy to tell whether any particular h(.) is a good
approximation of f(.). A good approximation will generalize well that is, will predict novel patterns correctly.
Generalization performance is, thus, the fundamental problem in inducting learning.
How do we judge the generalization performance of an algorithm? We can estimate such performance
by means of a test dataset, sampled independently as is the training set. The off-training set error – the error
on points not in the training set, will be used as a measure of generalization performance.
The assumption in inductive learning is that the ideal hypothesis related to unseen patterns in the one
induced by the observed training data. Inductive learning techniques obtain a general hypothesis by seeking
empirical regularities over the training examples. These regularities induce the approximation of the mapping
function well over the observed examples. It generalizes from the specific training examples, hypothesizing a
general function covering these examples.
Inductive learning is also known as discovery learning, is a process where the learner discovers rules by
observing examples. This is a difference from deductive learning, where students are given rules that they
then need to apply.
We can often workout rules for ourselves by observing examples to see if there is a pattern; to see if
things regularly happen in the same way. We then by applying the rules in different situations to see if it
works with deductive language learning, tasks are designed specifically. To guide the learner and assist them
in discovering a rule.
Example: when children are first shown how to ride a bicycle. It’s not possible for them to cycle unaided
immediately. Parents guide and assist their children until they have guided the confidence and skills that
enable them to ride on their own.
The bias measures the level to which the average of the hypothesis function is different from the desired
function f(x). The variance is a measure of the level to which the hypothesis function h(x; D j) is sensitive to
the specific selection of the dataset.
Data: A given set of examples from which we are trying to learn a target concept.
Target concept (aka rule): This is the hidden pattern we are trying to learn from the data for example
— “We go out for lunch IF it is sunny AND we got friends OR IF favourite cuisine restaurant is
nearby AND we got friends AND even IF it is NOT sunny”. This kind of rule is written in terms of
statements belonging to propositional logic i.e. Truth values expressed with AND, OR, NOT.
Hypothesis space: This is a set of hypotheses from which we hope to discover the target concept. In
this example, the ‘intelligent’ choice of hypotheses should look like the logically connected phrase we
wrote above as the target concept. But they could be simpler or different looking such as,
IF it is sunny
IF we got friends OR it is sunny
These two hypotheses are not rich enough to capture the true target concept and they will have
errors if applied to the given data.
Why?
Because they are not of the form “conjunction of disjunctive statements” i.e. (X1 & X2) | (X3 & X4) |
(X5 & !X6). They are too simplistic and general and unable to capture all the nuance of the data presented. To
search a target concept in the machine learning task, it depends largely on the richness and complexity of the
hypothesis space you choose to work with.
Predictor and response variables: These are self-explanatory. The ‘Sunny?’, ‘Favourite restaurant
distance’ and ‘Got friends?’ are predictor variables and ‘Go out for lunch’ is the response variable. If
you read the conditions satisfied by the edges of the tree starting from the root and down to a leaf, you
will automatically get propositional logic statements
For the most part, CLT focuses not on individual learning algorithms, but rather on broad classes of learning
algorithms characterized by the hypothesis spaces they consider, the presentation of training examples, etc.
The key quantities it tries to reason about are,
Sample complexity: How many training examples are needed for a learner
to converge (with high probability) to a successful hypothesis?
Computational complexity: How much computational effort is needed for a learner to converge (with
high probability) to a successful hypothesis?
Mistake bound: How many training examples will the learner misclassify
Therefore, we will require only that its error be bounded by some constant,
that can be made arbitrarily small. Moreover, we will not require that the learner succeed for every sequence
of randomly drawn training examples, instead, our demand will only be that its probability of failure be
bounded by some constant that can be made arbitrarily small.
Sample complexity bound:
The bound is given by,
This is called Haussler bound. This inequality readily demonstrates the following facts which we
experience in practice for any machine learning task,
We need higher number of training data to reduce test/generalization error.
The algorithm needs to ‘see’ a higher number of training examples to increase its confidence in
learning i.e. reduce the probability of mistake.
A richer and bigger hypothesis space means a larger dimension to search through in the quest
of the correct hypothesis, and that needs higher number of training examples.
Overfitting:
Overfitting is a statistical model is said to be overfitted when we train it with a lot of data. When model
gets trained with so much of data, it starts learning from the noise and inaccurate data entire in our dataset.
Then the model does not categorize the data correctly, because of too much of details and noise. The causes of
overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms
have more freedom in building the model based on the dataset and therefore they can really build unrealistic
models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.
The above graph depicts the effect of overfitting in a typical application of machine learning. The horizontal
axis of the plot shows how complex the classifier is. The vertical axis is an indicator of how accurate the
forecasts made by the classifier are. The line depicts how accurate the classifier is over the training examples,
while the broken line shows the accuracy measured over an independent set of test examples. Predictably, the
accuracy of the classifier over training examples increases monotonically as the classifier becomes more
complex. But, the accuracy over the independent test examples first grows, then reduces.
How to avoid Overfitting:
The commonly used methodologies are:
Cross- Validation: A standard way to find out-of-sample prediction error is to use 5-fold cross
validation.
Early Stopping: Its rules provide us the guidance as to how many iterations can be run before learner
begins to over-fit.
Pruning: Pruning is extensively used while building related models. It simply removes the nodes
which add little predictive power for the problem in hand.
Regularization: It introduces a cost term for bringing in more features with the objective function.
Hence it tries to push the coefficients for many variables to zero and hence reduce cost term.
Underfitting:-
A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture
the underlying trend of the data. Underfitting destroys the accuracy of our machine learning model. Its
occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens
when we have less data to build an accurate model and also when we try to build a linear model with a non-
linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on
such a minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can
be avoided by using more data and also reducing the features by feature selection.
The principal techniques used in heuristic search to optimize hypothesis complexity for a given training
dataset.
Regularization: The regularization model promotes smoother functions by creating a new criterion function
that relies not only on the training error, but also on algorithmic intricacy. Particularly, the new criterion
function punishes extremely complex hypothesis; looking for the minimum in this criterion is to balance error
on the training set with complexity. Formally it is possible to write the new criterion as a sum of the error on
the training set plus a regularization term, which depicts constraints or sought after properties of solution.
E’= error on data + λ * hypothesis complexity where λ gives the weight of penalty (Ω)
The second term penalizes complex hypotheses with large variance. When we minimize augmented
error function instead of the error on data only, we penalize complex hypothesis and thus decrease variance.
When is taken too large, only very simple functions are allowed and we risk introducing bias. It is optimized
using cross-validation.
Early Stopping: The training of a learning machine corresponds to iterative decrease in the error function
defined as per the training data. During a specific training session, this error generally reduces as a function of
the number of iterations in the algorithm. Stopping the training before attaining a minimum training error,
represents a technique of restricting the effective hypothesis complexity.
Pruning: An alternative solution that sometimes is more successful than early stopping the growth of the
hypothesis is pruning the full-grown hypothesis that is likely to be overfitting the training data. Pruning is the
basis of search in many decision-tree algorithms; weakest branches of large tree overfitting the training data,
which hardly reduces the error rate are removed.
maps from x € X and y € Y, where X is the input space and Y is the output space of the function. We assume
here that X is subset of Rn, and Y is subset of R. The data is available as samples (x, Y) where the distribution
of inputs x and the function f are both unknown.
The task is to find the model h(x) that explains the underlying data, i.e., h(x) = y for all samples (x, y).
Equivalently, the task is to approximate function f(x) with unknown properties by h(x).
The term used in statistics for function description of data is regression. Also, since h(x) predicts y € R for
a given x, the term numeric prediction is also in use. There are three terms called function approximation,
numeric prediction, and regression used interchangeably without any discrimination.
Estimating the error in prediction using holdout and random subsampling, cross validation and boostrap
methods are common techniques for assessing accuracy of the predictor. Several alternative metrics can be
used to assess the accuracy of numeric prediction.
2.8.1 Mean Square Error:
Mean Square Error (MSE) is the principal and most commonly used metric. For calculating MSE, we
assume that no statistical information on data is available; the mean is obtained from the training data as
arithmetic average.
This measures the average square deviation of the predicted values from true values.
Root Mean Square Error:
Taking the square root of MSE,
RMSE more clearly relates to individual errors; it has same dimensions as the predicted value itself.
Sum-of-Error Squares:
Sometimes total error, and not the average, is taken for mathematical manipulation by some statistical /
machine learning techniques:
This function measures the average deviation of the predicted value from the true value. In MAE, variance of
the model error is not accounted for. Suppose model h1(w, x) depicts small errors over the full range of the
data. They may as will have the same MAE. When each data point possesses the same significance, mode
h2(w, x) may be preferred as its errors possess lower variability. The amount of variability is considered by
MSE criterion.
This accuracy measure works well for the situations where class tuples are more or less evenly distributed.
However, when the classes are unbalanced, decisions made on classifications based on misclassification error
lead to poor performance.
Here,
Class 1 : Positive
Class 2 : Negative
Definition of the Terms:
Positive (P) : Observation is positive (for example: is an apple).
Negative (N) : Observation is not positive (for example: is not an apple).
True Positive (TP) : Observation is positive, and is predicted to be positive.
False Negative (FN) : Observation is positive, but is predicted negative.
True Negative (TN) : Observation is negative, and is predicted to be negative.
False Positive (FP) : Observation is negative, but is predicted positive.
Classification Rate/Accuracy:
Classification Rate or Accuracy is given by the relation:
However, there are problems with accuracy. It assumes equal costs for both kinds of errors. A 99% accuracy
can be excellent, good, mediocre, poor or terrible depending upon the problem.
Recall: Recall can be defined as the ratio of the total number of correctly classified positive examples divide
to the total number of positive examples. High Recall indicates the class is correctly recognized (a small
number of FN).
Precision:
To get the value of precision we divide the total number of correctly classified positive examples by the total
number of predicted positive examples. High Precision indicates an example labelled as positive is indeed
positive (a small number of FP).
High recall, low precision: This means that most of the positive examples are correctly recognized (low FN)
but there are a lot of false positives.
Low recall, high precision: This shows that we miss a lot of positive examples (high FN) but those we
predict as positive are indeed positive (low FP)
F-measure: Since we have two measures (Precision and Recall) it helps to have a measurement that
represents both of them. We calculate an F-measure which uses Harmonic Mean in place of Arithmetic Mean
as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.
Let’s consider an example now, in which we have infinite data elements of class B and a single element of
class A and the model is predicting class A against all the instances in the test data.
Here,
Precision : 0.0
Recall : 1.0
Now:
Arithmetic mean: 0.5
Harmonic mean: 0.0
When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome!
While taking the harmonic mean, the F-measure is 0.
Example to interpret confusion matrix:
For the simplification of the above confusion matrix I have added all the terms like TP, FP, etc and the row
and column totals in the following image:
Now,
Classification Rate/Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN) = (100 + 50) /(100 + 5 + 10 + 50) = 0.90
Recall: Recall gives us an idea about when it’s actually yes, how often does it predict yes.
Recall = TP / (TP + FN) = 100 / (100 + 5) = 0.95
Precision: Precision tells us about when it predicts yes, how often is it correct.
Precision = TP / (TP + FP)=100/ (100+10) = 0.91
F-measure:
F-measure = (2 * Recall * Precision) / (Recall + Precision) = (2 * 0.95 * 0.91) / (0.91 + 0.95) = 0.92
Notice in the figure below that TPR and FPR are derived from two separate rows of the confusion matrix.
This ensures that the class distribution in the data (prevalence) does not affect the curve. The YESs and the
NOs are being looked at separately.
The True Positive Rate looks at the actual YESs and the False Positive Rate looks at the actual NO’s. So, even
though the ROC plot contains information about both classes (YES and NO), these groups don’t interact. Each
value, and therefore the curve itself, does not change even if the prevalence/distribution of classes in test data
varies.
Purpose of ROC Curve
a) Purpose 1 — Analysing the strength/predictive power of a classifier
The job of our classification model is to assign higher probabilities to observations that belong to class
YES and lower probabilities to observations that belong to class NO. Basically, if there is a substantial
distinction in the probabilities assigned to observations of either class, we will be able to define a threshold
and predict the two classes with high confidence. The ROC curve of a good classifier is closer to the top left
of the graph. WHY? Because in a good classifier, we can achieve high TPR at low FPR.
The green curve shows the distribution of Class YES and the red curve shows distribution of Class NO over
the probability assigned to each observation by the classifier. Using threshold 0.5 we are able to classify most
observations correctly. There is a slight overlap of the distributions. The overlapping cases are the ones that
get misclassified at 0.5. If we move the threshold to 0.6, we will be able to classify almost 85% true positives
correctly and with 0 false positives. As we lower the threshold from 0.6 to 0.5, the TPR will increase but so
will the FPR.
We can use a colour coded ROC curve like below to be able to see each threshold on the curve. The color bar
on the right indicates the threshold value at each colour.