Week 5

CHAPTER 4:
MACHINE
Parametric Methods
LEARNING
Dr. SAEED UR REHMAN
Department of Computer Science,
COMSATS University Islamabad
Wah Campus
Parametric Methods
Courtesy : ETHEM ALPAYDIN © The MIT Press
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

Summary
 Supervised Learning is an important type of ML application
 It requires training and testing data; that is labelled. Making
decisions under uncertainty has a long history.
Reasoning from meaningful evidence using probability
theory is only a few hundred years old .
 Association rules are successfully used in many data mining
applications, and we see such rules on many Web sites that
recommend books, movies, music, and so on.
 The algorithm is very simple and its efficient
implementation on very large databases is critical

Introduction to Machine Learning
Outline
Introduction
 What are parametric methods/algorithms
 Benefits and limitations of Parametric Algorithms
 Maximum Likelihood Estimation
 Bernoulli Density
 Multinomial Density
 Gaussian (Normal) Density
 Evaluating an Estimator:
 Bias and Variance
 Bias and Variance Examples
 Bias and Variance tradeoff

Outline Cont.…
 The Bayes’ Estimator
 Parametric Classification
 Regression
 Tuning Model Complexity: Bias/Variance Dilemma
 Model Selection Procedures

Parametric Estimation…Intro
A statistic is any value that is calculated from a given sample.
In statistical inference, we make a decision using the information provided by a
sample.
In parametric estimation , we assume that the sample is drawn from some
distribution that obeys a known model, for example, Gaussian.
The advantage of the parametric approach is that the model is defined up to a
small number of parameters—for example, mean, variance—the sufficient
statistics of the distribution.
Once those parameters are estimated from the sample, the whole distribution
is known.

X = { xt }t where xt ~ p (x)
Parametric estimation:
Assume a form for p (x |q ) and
estimate q , its sufficient statistics, using X
e.g., N ( μ, σ2) where q = { μ, σ2}

Assumptions can greatly simplify the learning process, but can
also limit what can be learned.
Algorithms that simplify the function to a known form are called
parametric machine learning algorithms.
“A learning model that summarizes data with a set of parameters
of fixed size (independent of the number of training examples) is
called a parametric model. No matter how much data you throw
at a parametric model, it won’t change its mind about how many
parameters it needs.”
— Artificial Intelligence: A Modern Approach, page 737
Parametric ML Algos (Examples)
Logistic Regression
Linear Discriminant Analysis
Perceptron
Naive Bayes
Simple Neural Networks

Parametric MLLimitations
Algos
Benefits
Simpler: Easier to Constrained: These methods
are highly constrained to the
understand and interpret specified form.
results.
Limited Complexity: Not
Speed: Very fast to learn Suitable for complex
from data. problems
Less Data: Require less Poor Fit: In practice the

methods are unlikely to
training data and can work match the underlying
well even if the fit to the mapping function.
data is not perfect.

Maximum Likelihood Estimation
(MLE) is a technique used for estimating the parameters of a given distribution,
using some observed data.
For example, if a population is known to follow a normal distribution but
the mean and variance are unknown, MLE can be used to estimate them using a
limited sample of the population, by finding particular values of the mean and
variance so that the observation is the most likely result to have occurred.
MLE is useful in
Econometrics
 MRIs
Satellite imaging
Bayesian statistics

Cont…
Likelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
Log likelihood ( where log function converts products into sum and is good for simplified
calculations)
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)

Cont…
Let us now see some distributions that arise in the applications we
are interested in.
If we have a two-class problem, the distribution we use is Bernoulli
also called binomial.
When there are K > 2 classes, its generalization is the multinomial.
Gaussian (normal) density is the one most frequently used for
modeling class-conditional input densities with numeric input.
For these three distributions, we discuss the maximum likelihood
estimators (MLE) of their parameters

Cont…
Example

Examples: Bernoulli/Multinomial
In a Bernoulli distribution, there are two outcomes:
An event occurs or it does not;
For example, an instance is a positive example of the class, or it is
not. The event occurs and the Bernoulli random variable X takes
the value 1 with probability p, and the nonoccurrence of the event
has probability 1 - p and this is denoted by X taking the value 0.
This is written as

Examples: Bernoulli
The expected value and variance can be calculated as

Examples: Bernoulli/Multinomial
Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
Practical Applications
The binomial distribution is applicable to most situations in which a specific
target result is known, by designating the target as "success" and anything
other than the target as "failure.“ for example coin tossing, dice rolling etc…

17
Examples: Multinomial
Consider the generalization of Bernoulli where instead of two
states, the outcome of a random event is one of K mutually
exclusive and exhaustive states,
Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit

MLE: pi = ∑t xit / N

Examples: Multinomial

Example: Gaussian (Normal)
Distribution
Gaussian distribution (also known as normal distribution) is a
bell-shaped curve, and it is assumed that during any
measurement values will follow a normal distribution with an
equal number of measurements above and below the mean value.

Example: Gaussian (Normal)
Distribution
In order to understand normal distribution, it is important to know
the definitions of “mean,” “median,” and “mode.”
The “mean” is the calculated average of all values, the “median” is
the value at the center point (mid-point) of the distribution, while
the “mode” is the value that was observed most frequently during
the measurement.
If a distribution is normal, then the values of the mean, median,
and mode are the same.
However, the value of the mean, median, and mode may be
different if the distribution is skewed (not Gaussian distribution).
Gaussian (Normal) Distribution
p(x) = N ( μ, σ2)
1  x   2 
px   exp  
2   2 2

MLE for μ and σ2:
μ σ  x t
m t
N
 x  m
t 2
s2  t
N
Bias and Variance
In supervised machine learning training data is used by an algorithm for
learning. The goal of any such ML algorithm is to best estimate the
mapping function (f) for the output variable (Y) given the input data (X).
This f is often called the target function because that a given supervised
machine learning algorithm aims to approximate.
Bias are the simplifying assumptions made by a model to make the target
function easier to learn.
Generally, linear algorithms have a high bias making them fast to learn
and easier to understand but generally less flexible.
In turn, they have lower predictive performance on complex problems
that fail to meet the simplifying assumptions of the algorithms bias.

Bias and Variance
Variance is the amount that the estimate of the target function will change if
different training data was used.
The target function is estimated from the training data by a machine learning
algorithm, so we should expect the algorithm to have some variance.
Ideally, it should not change too much from one training dataset to the next,
meaning that the algorithm is good at picking out the hidden underlying
mapping between the inputs and the output variables.
Machine learning algorithms that have a high variance are strongly influenced
by the specifics of the training data. This means that the specifics of the
training have influences the number and types of parameters used to
characterize the mapping function.

Bias and Variance Example
Image Courtesy: https://machinelearningmastery.com/

Image courtesy
https://medium.com/@ml.at.berkeley/machine-learning-crash-course-part-4-the-bias-variance-dilemma-a94e60ec1d3

Underfitting Overfitting

Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample
Xi
Bias: bq(d) = E [d] – q
Variance: E [(d–E [d])2]
Mean square error:
r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E
[d])2]
= Bias +toVariance
D E P A R T M E N T O F C O M P U T E R 2S C I E N C E , C O M S A T S U N I V E R S I T Y I S L A M A B A D - W A H C A M P U S
Introduction Machine Learning
Bias and Variance
The ML Algorithm Prediction error can be broken down into three parts:
Bias Error
Variance Error
Irreducible Error
Bias Error
Low Bias: Suggests less assumptions about the form of the target function.
High-Bias: Suggests more assumptions about the form of the target
function.

Bias and Variance
Examples of low-bias machine learning algorithms include: Decision Trees, k-
Nearest Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear
Regression, Linear Discriminant Analysis and Logistic Regression.
Variance Error
Low Variance: Suggests small changes to the estimate of the target function with
changes to the training dataset.
High Variance: Suggests large changes to the estimate of the target function with
changes to the training dataset.
Generally, nonlinear machine learning algorithms that have a lot of flexibility have a
high variance. For example, decision trees have a high variance, that is even higher
if the trees are not pruned before use.

Bias-Variance Trade-Off
The goal of any supervised machine learning algorithm is to achieve low bias and low
variance. In turn the algorithm should achieve good prediction performance.
The parameterization of machine learning algorithms is often a battle to balance out bias
and variance.
Below are two examples of configuring the bias-variance trade-off for specific
algorithms:
The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can
be changed by increasing the value of k which increases the number of neighbors that
contribute t the prediction and in turn increases the bias of the model.
The support vector machine algorithm has low bias and high variance, but the trade-off
can be changed by increasing the C parameter that influences the number of violations of
the margin allowed in the training data which increases the bias but decreases the
variance.

Bias-Variance Trade-Off
To summarize Bias and Variance we can say that there is no
escaping the relationship between bias and variance in ML
Increasing the bias will decrease the variance.
Increasing the variance will decrease the bias.
There is a trade-off at play between these two concerns and the
algorithms you choose and the way you choose to configure them
are finding different balances in this trade-off for your problem

Summary
In statistical inference, we make a decision using the information
provided by a sample.
In parametric estimation , we assume that the sample is drawn from
some distribution that obeys a known model.
In a parametric model, all of the training instances affect the final
global estimate.
We saw how we can estimate these probabilities from a given training
set.
We start with the parametric approach for classification and regression
We also took a look on bias/variance dilemma and model selection
methods for trading off model complexity and empirical error .

References
This lecture is prepared from the following resources
https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/gaussian-distribution
https:// machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/
Artificial Intelligence: A Modern Approach (4th Edition) (Pearson Series in Artifical Intelligence)
https://brilliant.org/wiki/maximum-likelihood-estimation-mle/
https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788830577/1/ch01lvl1sec15/bias-variance-trade
-off
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning
https://study.com/academy/lesson/bayes-estimator-definition-examples.html
https://ch.mathworks.com/help/stats/introduction-to-parametric-classification.html

Next Lecture
Multivariate Methods
Introduction
 Parameter Estimation
 Estimation of Missing Values
 Multivariate Normal Distribution
 Multivariate Classification
 Tuning Complexity
 Discrete Features
 Multivariate Regression

Summary of Ch 1
 What Is Machine Learning?
 Examples of Machine Learning Applications
Learning Associations
Classification
Regression
Unsupervised Learning
Reinforcement Learning

Outline Parametric Methods

Ch 2 : Supervised Learning
 The machine learning task of learning a function
that maps an input to an output based on example
input-output pairs.
Therefore we need
 Training Data , that a supervised learning algorithm
analyzes the training data and produces an inferred
function, which can be used for mapping new
examples.
 This requires the learning algorithm to generalize
from the training data to unseen situations in a
"reasonable" way
Supervised Learning
Supervised[labelled data]
◦ Classification
◦ Regression
Unsupervised [unlabeled data]

◦ Clustering
◦ Association
Semi Supervised[some labelled data]

Reinforcement Learning [reward based ]

Learning a Class from Examples
Class C of a “family car”
◦ Prediction: Is car x a family car?
◦ Knowledge extraction: What do people expect from a family
car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power

Training set X
X  {xt ,r t }tN1
 1 if x is positive
r 
0 if x is negative
 x1 
x 
x2 

Class C
p1  price  p2  AND e1  engine power  e2 

Hypothesis class H
 1 if h says x is positive
h( x)  
0 if h says x is negative

Outcomes of Classification

Confusion Matrix
The confusion matrix visualizes the accuracy of a classifier by
comparing the actual and predicted classes. The binary confusion
matrix is composed of squares:
https://www.guru99.com/confusion-matrix-machine-learning-example.html

Accuracy, Precision, Recall
Interpretation of Performance Measures
Accuracy tells you how many times the ML model was
correct overall.
Accuracy = TP+TN/TP+FP+FN+TN
Precision is how good the model is at predicting a specific
category.
Precision = TP/TP+FP
Recall tells you how many times the model was able to
detect a specific category.
Recall = TP/TP+FN
Why you need Confusion
matrix?
•It shows how any classification model is confused when it makes
predictions.
•Confusion matrix not only gives you insight into the errors being
made by your classifier but also types of errors that are being made.
•This breakdown helps you to overcomes the limitation of using
classification accuracy alone.
•Every column of the confusion matrix represents the instances of
that predicted class.
•Each row of the confusion matrix represents the instances of the
actual class.
•It provides insight not only the errors which are made by a classifier
but also errors that are being made.

Definitions
A true positive is an outcome where the
model correctly predicts the positive class. Similarly,
A true negative is an outcome where the
model correctly predicts the negative class.
A false positive is an outcome where the
model incorrectly predicts the positive class. And
A false negative is an outcome where the
model incorrectly predicts the negative class.


Week 5

Uploaded by

Copyright:

Available Formats

Week 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 5

Uploaded by

Copyright:

Available Formats

CHAPTER 4:

Courtesy : ETHEM ALPAYDIN © The MIT Press

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

Less Data: Require less Poor Fit: In practice the

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

l (θ|X) = p (X |θ) = ∏t p (xt|θ)

L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

Maximum likelihood estimator (MLE)

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

L(p1,p2,...,pK|X) = log ∏t ∏i pixit

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

Image Courtesy: https://machinelearningmastery.com/

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

Unsupervised [unlabeled data]

Semi Supervised[some labelled data]

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

p1  price  p2  AND e1  engine power  e2 

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

You might also like