Week 5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 49

CHAPTER 4:

MACHINE
Parametric Methods
LEARNING
Dr. SAEED UR REHMAN
Department of Computer Science,
COMSATS University Islamabad
Wah Campus
Parametric Methods

Courtesy : ETHEM ALPAYDIN © The MIT Press

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Summary
 Supervised Learning is an important type of ML application
 It requires training and testing data; that is labelled. Making
decisions under uncertainty has a long history.
Reasoning from meaningful evidence using probability
theory is only a few hundred years old .
 Association rules are successfully used in many data mining
applications, and we see such rules on many Web sites that
recommend books, movies, music, and so on.
 The algorithm is very simple and its efficient
implementation on very large databases is critical

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Outline
Introduction
 What are parametric methods/algorithms
 Benefits and limitations of Parametric Algorithms
 Maximum Likelihood Estimation
 Bernoulli Density
 Multinomial Density
 Gaussian (Normal) Density
 Evaluating an Estimator:
 Bias and Variance
 Bias and Variance Examples
 Bias and Variance tradeoff

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Outline Cont.…
 The Bayes’ Estimator
 Parametric Classification
 Regression
 Tuning Model Complexity: Bias/Variance Dilemma
 Model Selection Procedures

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Parametric Estimation…Intro
A statistic is any value that is calculated from a given sample.
In statistical inference, we make a decision using the information provided by a
sample.
In parametric estimation , we assume that the sample is drawn from some
distribution that obeys a known model, for example, Gaussian.
The advantage of the parametric approach is that the model is defined up to a
small number of parameters—for example, mean, variance—the sufficient
statistics of the distribution.
Once those parameters are estimated from the sample, the whole distribution
is known.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Parametric Estimation…Intro
X = { xt }t where xt ~ p (x)
Parametric estimation:
Assume a form for p (x |q ) and
estimate q , its sufficient statistics, using X
e.g., N ( μ, σ2) where q = { μ, σ2}

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Parametric Estimation…Intro
Assumptions can greatly simplify the learning process, but can
also limit what can be learned.
Algorithms that simplify the function to a known form are called
parametric machine learning algorithms.
“A learning model that summarizes data with a set of parameters
of fixed size (independent of the number of training examples) is
called a parametric model. No matter how much data you throw
at a parametric model, it won’t change its mind about how many
parameters it needs.”
— Artificial Intelligence: A Modern Approach, page 737
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS
Introduction to Machine Learning
Parametric ML Algos (Examples)
Logistic Regression
Linear Discriminant Analysis
Perceptron
Naive Bayes
Simple Neural Networks

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Parametric MLLimitations
Algos
Benefits
Simpler: Easier to Constrained: These methods
are highly constrained to the
understand and interpret specified form.
results.
Limited Complexity: Not
Speed: Very fast to learn Suitable for complex
from data. problems

Less Data: Require less Poor Fit: In practice the


methods are unlikely to
training data and can work match the underlying
well even if the fit to the mapping function.
data is not perfect.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Maximum Likelihood Estimation
(MLE) is a technique used for estimating the parameters of a given distribution,
using some observed data.
For example, if a population is known to follow a normal distribution but
the mean and variance are unknown, MLE can be used to estimate them using a
limited sample of the population, by finding particular values of the mean and
variance so that the observation is the most likely result to have occurred.
MLE is useful in
Econometrics
 MRIs
Satellite imaging
Bayesian statistics

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Maximum Likelihood Estimation
Cont…
Likelihood of q given the sample X

l (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood ( where log function converts products into sum and is good for simplified
calculations)

L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

Maximum likelihood estimator (MLE)

θ* = argmaxθ L(θ|X)

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Maximum Likelihood Estimation
Cont…
Let us now see some distributions that arise in the applications we
are interested in.
If we have a two-class problem, the distribution we use is Bernoulli
also called binomial.
When there are K > 2 classes, its generalization is the multinomial.
Gaussian (normal) density is the one most frequently used for
modeling class-conditional input densities with numeric input.
For these three distributions, we discuss the maximum likelihood
estimators (MLE) of their parameters

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Maximum Likelihood Estimation
Cont…
Example

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Examples: Bernoulli/Multinomial
In a Bernoulli distribution, there are two outcomes:
An event occurs or it does not;
For example, an instance is a positive example of the class, or it is
not. The event occurs and the Bernoulli random variable X takes
the value 1 with probability p, and the nonoccurrence of the event
has probability 1 - p and this is denoted by X taking the value 0.
This is written as

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Examples: Bernoulli
The expected value and variance can be calculated as

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Examples: Bernoulli/Multinomial
Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N

Practical Applications
The binomial distribution is applicable to most situations in which a specific
target result is known, by designating the target as "success" and anything
other than the target as "failure.“ for example coin tossing, dice rolling etc…

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
17
Examples: Multinomial
Consider the generalization of Bernoulli where instead of two
states, the outcome of a random event is one of K mutually
exclusive and exhaustive states,
Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi

L(p1,p2,...,pK|X) = log ∏t ∏i pixit


MLE: pi = ∑t xit / N

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Examples: Multinomial

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Example: Gaussian (Normal)
Distribution
Gaussian distribution (also known as normal distribution) is a
bell-shaped curve, and it is assumed that during any
measurement values will follow a normal distribution with an
equal number of measurements above and below the mean value.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Example: Gaussian (Normal)
Distribution
In order to understand normal distribution, it is important to know
the definitions of “mean,” “median,” and “mode.”
The “mean” is the calculated average of all values, the “median” is
the value at the center point (mid-point) of the distribution, while
the “mode” is the value that was observed most frequently during
the measurement.
If a distribution is normal, then the values of the mean, median,
and mode are the same.
However, the value of the mean, median, and mode may be
different if the distribution is skewed (not Gaussian distribution).
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS
Introduction to Machine Learning
Gaussian (Normal) Distribution
p(x) = N ( μ, σ2)

1  x   2 
px   exp  
2   2 2

MLE for μ and σ2:

μ σ  x t

m t

N
 x  m
t 2

s2  t

N
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS
Introduction to Machine Learning
Bias and Variance
In supervised machine learning training data is used by an algorithm for
learning. The goal of any such ML algorithm is to best estimate the
mapping function (f) for the output variable (Y) given the input data (X).
This f is often called the target function because that a given supervised
machine learning algorithm aims to approximate.
Bias are the simplifying assumptions made by a model to make the target
function easier to learn.
Generally, linear algorithms have a high bias making them fast to learn
and easier to understand but generally less flexible.
In turn, they have lower predictive performance on complex problems
that fail to meet the simplifying assumptions of the algorithms bias.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias and Variance
Variance is the amount that the estimate of the target function will change if
different training data was used.
The target function is estimated from the training data by a machine learning
algorithm, so we should expect the algorithm to have some variance.
Ideally, it should not change too much from one training dataset to the next,
meaning that the algorithm is good at picking out the hidden underlying
mapping between the inputs and the output variables.
Machine learning algorithms that have a high variance are strongly influenced
by the specifics of the training data. This means that the specifics of the
training have influences the number and types of parameters used to
characterize the mapping function.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias and Variance Example

Image Courtesy: https://machinelearningmastery.com/

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias and Variance Example

Image courtesy
https://medium.com/@ml.at.berkeley/machine-learning-crash-course-part-4-the-bias-variance-dilemma-a94e60ec1d3

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias and Variance Example

Underfitting Overfitting

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample
Xi
Bias: bq(d) = E [d] – q
Variance: E [(d–E [d])2]
Mean square error:
r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E
[d])2]
= Bias +toVariance
D E P A R T M E N T O F C O M P U T E R 2S C I E N C E , C O M S A T S U N I V E R S I T Y I S L A M A B A D - W A H C A M P U S
Introduction Machine Learning
Bias and Variance
The ML Algorithm Prediction error can be broken down into three parts:
Bias Error
Variance Error
Irreducible Error

Bias Error
Low Bias: Suggests less assumptions about the form of the target function.
High-Bias: Suggests more assumptions about the form of the target
function.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias and Variance
Examples of low-bias machine learning algorithms include: Decision Trees, k-
Nearest Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear
Regression, Linear Discriminant Analysis and Logistic Regression.
Variance Error
Low Variance: Suggests small changes to the estimate of the target function with
changes to the training dataset.
High Variance: Suggests large changes to the estimate of the target function with
changes to the training dataset.
Generally, nonlinear machine learning algorithms that have a lot of flexibility have a
high variance. For example, decision trees have a high variance, that is even higher
if the trees are not pruned before use.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias-Variance Trade-Off
The goal of any supervised machine learning algorithm is to achieve low bias and low
variance. In turn the algorithm should achieve good prediction performance.
The parameterization of machine learning algorithms is often a battle to balance out bias
and variance.
Below are two examples of configuring the bias-variance trade-off for specific
algorithms:
The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can
be changed by increasing the value of k which increases the number of neighbors that
contribute t the prediction and in turn increases the bias of the model.
The support vector machine algorithm has low bias and high variance, but the trade-off
can be changed by increasing the C parameter that influences the number of violations of
the margin allowed in the training data which increases the bias but decreases the
variance.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Bias-Variance Trade-Off
To summarize Bias and Variance we can say that there is no
escaping the relationship between bias and variance in ML
Increasing the bias will decrease the variance.
Increasing the variance will decrease the bias.
There is a trade-off at play between these two concerns and the
algorithms you choose and the way you choose to configure them
are finding different balances in this trade-off for your problem

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Summary
In statistical inference, we make a decision using the information
provided by a sample.
In parametric estimation , we assume that the sample is drawn from
some distribution that obeys a known model.
In a parametric model, all of the training instances affect the final
global estimate.
We saw how we can estimate these probabilities from a given training
set.
We start with the parametric approach for classification and regression
We also took a look on bias/variance dilemma and model selection
methods for trading off model complexity and empirical error .

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
References
This lecture is prepared from the following resources
https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/gaussian-distribution

https:// machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/

Artificial Intelligence: A Modern Approach (4th Edition) (Pearson Series in Artifical Intelligence)

https://brilliant.org/wiki/maximum-likelihood-estimation-mle/

https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788830577/1/ch01lvl1sec15/bias-variance-trade
-off

https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning

https://study.com/academy/lesson/bayes-estimator-definition-examples.html

https://ch.mathworks.com/help/stats/introduction-to-parametric-classification.html

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Next Lecture
Multivariate Methods
Introduction
 Parameter Estimation
 Estimation of Missing Values
 Multivariate Normal Distribution
 Multivariate Classification
 Tuning Complexity
 Discrete Features
 Multivariate Regression

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Summary of Ch 1
 What Is Machine Learning?
 Examples of Machine Learning Applications
Learning Associations
Classification
Regression
Unsupervised Learning
Reinforcement Learning

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Outline Parametric Methods

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Ch 2 : Supervised Learning
 The machine learning task of learning a function
that maps an input to an output based on example
input-output pairs.
Therefore we need
 Training Data , that a supervised learning algorithm
analyzes the training data and produces an inferred
function, which can be used for mapping new
examples.
 This requires the learning algorithm to generalize
from the training data to unseen situations in a
"reasonable" way
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS
Introduction to Machine Learning
Supervised Learning
Supervised[labelled data]
◦ Classification
◦ Regression

Unsupervised [unlabeled data]


◦ Clustering
◦ Association

Semi Supervised[some labelled data]


Reinforcement Learning [reward based ]

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Learning a Class from Examples
Class C of a “family car”
◦ Prediction: Is car x a family car?
◦ Knowledge extraction: What do people expect from a family
car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Training set X
X  {xt ,r t }tN1

 1 if x is positive
r 
0 if x is negative

 x1 
x 
x2 

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Class C

p1  price  p2  AND e1  engine power  e2 


DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS
Introduction to Machine Learning
Hypothesis class H

 1 if h says x is positive
h( x)  
0 if h says x is negative

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Outcomes of Classification

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Confusion Matrix
The confusion matrix visualizes the accuracy of a classifier by
comparing the actual and predicted classes. The binary confusion
matrix is composed of squares:

https://www.guru99.com/confusion-matrix-machine-learning-example.html

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Accuracy, Precision, Recall
Interpretation of Performance Measures
Accuracy tells you how many times the ML model was
correct overall.
Accuracy = TP+TN/TP+FP+FN+TN
Precision is how good the model is at predicting a specific
category.
Precision = TP/TP+FP
Recall tells you how many times the model was able to
detect a specific category.
Recall = TP/TP+FN
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS
Introduction to Machine Learning
Why you need Confusion
matrix?
•It shows how any classification model is confused when it makes
predictions.
•Confusion matrix not only gives you insight into the errors being
made by your classifier but also types of errors that are being made.
•This breakdown helps you to overcomes the limitation of using
classification accuracy alone.
•Every column of the confusion matrix represents the instances of
that predicted class.
•Each row of the confusion matrix represents the instances of the
actual class.
•It provides insight not only the errors which are made by a classifier
but also errors that are being made.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
Definitions
A true positive is an outcome where the
model correctly predicts the positive class. Similarly,
A true negative is an outcome where the
model correctly predicts the negative class.
A false positive is an outcome where the
model incorrectly predicts the positive class. And
A false negative is an outcome where the
model incorrectly predicts the negative class.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS


Introduction to Machine Learning
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS
Introduction to Machine Learning

You might also like