0% found this document useful (0 votes)

9 views100 pages

Date: Venue:: 28-11-2023, Saveetha School of Engineering

The document outlines a presentation on Adversarial Machine Learning by Dr. Saravanan.M.S, covering machine learning basics, types of learning, and neural network architectures. It discusses supervised and unsupervised learning techniques, the role of deep learning, and the structure of neural networks including activation functions and training methods. Additionally, it touches on the No-Free-Lunch theorem and the differences between conventional machine learning and deep learning approaches.

Uploaded by

saran20099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views100 pages

Date: Venue:: 28-11-2023, Saveetha School of Engineering

Uploaded by

saran20099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 100

Date: 28-11-2023, Venue: Saveetha School of Engineering

1
Adversarial Machine Learning

Dr. Saravanan.M.S
Innovation Ambassador,
IIC of Saveetha Institute of Medical and Technical Sciences,
Professor, Institute of CSE, SSE, SIMATS.
Outline

• Machine learning basics

 Supervised and unsupervised learning
 Linear and non-linear classification methods
• Introduction to deep learning
• Elements of neural networks (NNs)
 Activation functions
• Training NNs
 Gradient descent
 Regularization methods
• NN architectures
 Convolutional NNs
 Recurrent NNs

3
Machine Learning Basics
Machine Learning Basics

• Artificial Intelligence is a scientific field concerned with the

development of algorithms that allow computers to learn without
being explicitly programmed
• Machine Learning is a branch of Artificial Intelligence, which
focuses on methods that learn from data and make predictions on
unseen data
Machine Learning
Labeled Data algorithm

Training
Prediction

Learned
Labeled Data Prediction
model

4
Machine Learning Types
Machine Learning Basics

• Supervised: learning with labeled data

 Example: email classification, image classification
 Example: regression for predicting real-valued outputs
• Unsupervised: discover patterns in unlabeled data
 Example: cluster similar data points
• Reinforcement learning: learn to act based on feedback/reward
 Example: learn to play Go

class A

class B

Regression Clustering
Classification

5
Supervised Learning
Machine Learning Basics

• Supervised learning categories and techniques

 Numerical classifier functions
o Linear classifier, perceptron, logistic regression, support vector machines (SVM),
neural networks
 Parametric (probabilistic) functions
o Naïve Bayes, Gaussian discriminant analysis (GDA), hidden Markov models
(HMM), probabilistic graphical models
 Non-parametric (instance-based) functions
o k-nearest neighbors, kernel regression, kernel density estimation, local
regression
 Symbolic functions
o Decision trees, classification and regression trees (CART)
 Aggregation (ensemble) learning
o Bagging, boosting (Adaboost), random forest

6
Unsupervised Learning
Machine Learning Basics

7
Nearest Neighbor Classifier
Machine Learning Basics

• Nearest Neighbor – for each test data point, assign the class label of
the nearest training data point
 Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class
of the nearest data point (minimum distance)
 It does not require learning a set of weights

Test Training
Training example examples
examples from class 2
from class 1

8
Nearest Neighbor Classifier
Machine Learning Basics

• For image classification, the distance between all pixels is calculated

(e.g., using norm, or norm)
 Accuracy on CIFAR-10: 38.6%
• Disadvantages:
 The classifier must remember all training data and store it for future
comparisons with the test data
 Classifying a test image is expensive since it requires a comparison to all
training images

norm
(Manhattan distance)

Picture from: https://cs231n.github.io/classification/ 9

k-Nearest Neighbors Classifier
Machine Learning Basics

• k-Nearest Neighbors approach considers multiple neighboring data

points to classify a test data point
 E.g., 3-nearest neighbors
o The test example in the figure is the + mark
o The class of the test example is obtained by voting (based on the distance to the
3 closest points)

x2
x
x
x o
x x
x
+ o x
o x
o + x
o
o o
o
o

Picture from: James Hays – Machine Learning Overview 10

Linear Classifier
Machine Learning Basics

• Linear classifier
 Find a linear function f of the inputs xi that separates the classes

 Use pairs of inputs and labels to find the weights matrix W and the bias
vector b
o The weights and biases are the parameters of the function f
 Several methods have been used to find the optimal set of parameters of a
linear classifier
o A common method of choice is the Perceptron algorithm, where the parameters
are updated until a minimal error is reached (single layer, does not use
backpropagation)
 Linear classifier is a simple approach, but it is a building block of
advanced classification algorithms, such as SVM and neural networks
o Earlier multi-layer neural networks were referred to as multi-layer perceptrons
(MLPs)

11
Linear Classifier
Machine Learning Basics

• The decision boundary is linear

 A straight line in 2D, a flat plane in 3D, a hyperplane in
3D and higher dimensional space
• Example: classify an input image
 The selected parameters in this example are not good,
because the predicted cat score is low

Picture from: https://cs231n.github.io/classification/ 12

Support Vector Machines
Machine Learning Basics

• Support vector machines (SVM)

 How to find the best decision boundary?
o All lines in the figure correctly separate the 2
classes
o The line that is farthest from all training
examples will have better generalization
capabilities
 SVM solves an optimization problem:
oo Next,
First,increase
identifythe
a decision
geometricboundary
margin that
correctly
between classifies
the boundarythe
andexamples
all examples
 The data points that define the maximum
margin width are called support vectors
 Find W and b by solving:

13
Linear vs Non-linear Techniques
Linear vs Non-linear Techniques

• Linear classification techniques

 Linear classifier
 Perceptron
 Logistic regression
 Linear SVM
 Naïve Bayes
• Non-linear classification techniques
 k-nearest neighbors
 Non-linear SVM
 Neural networks
 Decision trees
 Random forest

14
Linear vs Non-linear Techniques
Linear vs Non-linear Techniques

• For some tasks, input

data can be linearly
separable, and linear
classifiers can be
suitably applied

• For other tasks, linear

classifiers may have
difficulties to produce
adequate decision
boundaries

15
Non-linear Techniques
Linear vs Non-linear Techniques

• Non-linear classification
 Features are obtained as non-linear functions of the inputs
 It results in non-linear decision boundaries
 Can deal with non-linearly separable data

Inputs:

Features:

Outputs:
16
Non-linear Support Vector
Machines
Linear vs Non-linear Techniques

• Non-linear SVM
 The original input space is mapped to a higher-dimensional feature space
where the training set is linearly separable
 Define a non-linear kernel function to calculate a non-linear decision
boundary in the original feature space

Φ : 𝑥 ↦ 𝜙 (𝑥 )

17
Binary vs Multi-class
Classification
Binary vs Multi-class Classification

• A classification problem with only 2 classes is referred to as binary

classification
 The output labels are 0 or 1
 E.g., benign or malignant tumor, spam or no-spam email
• A problem with 3 or more classes is referred to as multi-class
classification

18
Binary vs Multi-class
Classification
Binary vs Multi-class Classification

• Both the binary and multi-class classification problems can be linearly

or non-linearly separated
 Figure: linearly and non-linearly separated data for binary classification
problem

19
Computer Vision Tasks
Machine Learning Basics

• Computer vision has been the primary area of interest for ML

• The tasks include: classification, localization, object detection,
instance segmentation

Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 20
Thank you

Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 21
No-Free-Lunch Theorem
Machine Learning Basics

• Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems

• The derived classification models for supervised learning are
simplifications of the reality
 The simplifications are based on certain assumptions
 The assumptions fail in some situations
o E.g., due to inability to perfectly estimate ML model parameters from limited
data
• In summary, No-Free-Lunch Theorem states:
 No single classifier works the best for all possible problems
 Since we need to make assumptions to generalize

22
ML vs. Deep Learning
Introduction to Deep Learning

• Conventional machine learning methods rely on human-designed

feature representations
 ML becomes just optimizing weights to best make a final prediction

23
ML vs. Deep Learning
Introduction to Deep Learning

• Deep learning (DL) is a machine learning subfield that uses multiple

layers for learning data representations
 DL is exceptionally effective at learning patterns

Picture from: https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png 24

ML vs. Deep Learning
Introduction to Deep Learning

• DL applies a multi-layer process for learning rich hierarchical

features (i.e., data representations)
 Input image pixels → Edges → Textures → Parts → Objects

Low-Level Mid-Level High-Level Trainable

Output
Features Features Features Classifier

25
Why is DL Useful?
Introduction to Deep Learning

• DL provides a flexible, learnable framework for representing visual,

text, linguistic information
 Can learn in supervised and unsupervised manner
• DL represents an effective end-to-end learning system
• Requires large amounts of training data
• Since about 2010, DL has outperformed other ML techniques
 First in vision and speech, then NLP, and other applications

26
Representational Power
Introduction to Deep Learning

• NNs with at least one hidden layer are universal approximators

 Given any continuous function h(x) and some , there exists a NN with one
hidden layer (and with a reasonable choice of non-linearity) described with
the function f(x), such that
 I.e., NN can approximate any arbitrary complex continuous function

• NNs use nonlinear mapping of the inputs x to

the outputs f(x) to compute complex decision
boundaries
• But then, why use deeper NNs?
 The fact that deep NNs work better is an
empirical observation
 Mathematically, deep NNs have the same
representational power as a one-layer NN

27
Introduction to Neural Networks
Introduction to Neural Networks

• Handwritten digit recognition (MNIST dataset)

 The intensity of each pixel is considered an input element
 Output is the class of the digit

Input Output

x1 y1
0.1 is 1
x2
y2
0.7 is 2
The image is “2”
……

……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents the
confidence of a digit
No ink → 0
28
Introduction to Neural Networks
Introduction to Neural Networks

• Handwritten digit recognition

x1 y1
x2
Machine y2
“2
……

……
”
x256 𝑓 :𝑅
256
→𝑅
10
y10
The function is represented by a neural network

29
Elements of Neural Networks
Introduction to Neural Networks

• NNs consist of hidden layers with neurons (i.e., computational units)

• A single neuron maps a set of inputs into an output number, or

z a1w1  a2 w2    aK wK  b
a1 w1
𝑎=𝜎 ( 𝑧 )
a2 w2
z  z  a

…

wK output
…

aK Activation
weights function
b
input
bias

30
Elements of Neural Networks
Introduction to Neural Networks

• A NN with one hidden layer and one output layer

Weights Biases

𝒉𝒊𝒅𝒅𝒆𝒏 𝒍𝒂𝒚𝒆𝒓 𝒉 =𝝈 ( 𝐖 𝟏 𝒙 + 𝒃𝟏 )

𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓 𝒚 =𝝈 (𝑾 𝟐 𝒉+𝒃𝟐 )

Activation functions

4 + 2 = 6 neurons (not counting inputs)

𝒚 [3 × 4] + [4 × 2] = 20 weights
4 + 2 = 6 biases
𝒙 26 learnable parameters

𝒉
31
Elements of Neural Networks
Introduction to Neural Networks

• A neural network playground link

32
Elements of Neural Networks
Introduction to Neural Networks

• Deep NNs have many hidden layers

 Fully-connected (dense) layers (a.k.a. Multi-Layer Perceptron or MLP)
 Each neuron is connected to all neurons in the succeeding layer

Input Layer 1 Layer 2 Layer L Output

x1 …… y1
x2 …… y2

……
……

……

……
xN …… yM

Input Layer Output Layer

Hidden Layers
33
Elements of Neural Networks
Introduction to Neural Networks

• A simple network, toy example

( 1 ∙1 ) + (− 1 ) ∙ ( − 2 )+ 1= 4

0.98 Sigmoid Function

1 4
1
-2 1
 z  
1 1  e z
 z 
-1 -2 0.12
-1
1 z
0

1 -2
34
Elements of Neural Networks
Introduction to Neural Networks

• A simple network, toy example (cont’d)

 For an input vector , the output is

1 4 0.98 2 0.86 3 0.62

1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2

𝑓 :𝑅 →𝑅
2 2 𝑓
([ ]) [
1
−1
=
0 .62
0.83 ]
35
Matrix Operation
Introduction to Neural Networks

• Matrix operations are helpful when working with multidimensional

inputs and outputs

1 4 0.98
1 W x + b a
-2
1

-1
-1 -2 0.12 𝜎[
1
−1
−2
1] [ ] +¿ [ ] ¿ [
(
−
1
1
1
)
0
0 .98
0.12 ]
1
0 [ −2]
4

36
Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for the first layer

 Input vector x, weights matrix W1, bias vector b1, output vector a1

x1 …… y1
x 2 W1 …… y2
b1

……
……

……

……
xN x a1 …… yM

a1 W1 x + b1

37
Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for all layers

x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……
xN x a1 ……
a2 y yM

𝜎W1 x(+ b)
1

𝜎W2 a1(+ b)
2

𝜎 L-1 + )
WL a( bL
38
Matrix Operation
Introduction to Neural Networks

• Multilayer NN, function f maps inputs x to outputs y, i.e.,

x1 …… y1
x 2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……
xN x a1 ……
a2 y yM

y ¿ 𝑓 x( )
¿ WL … 𝜎
𝜎
W2 𝜎 b)
+ ( + b2 …
) + bL
1
W1 x(
( )

39
Softmax Layer
Introduction to Neural Networks

• In multi-class classification tasks, the output layer is typically a

softmax layer
 I.e., it employs a softmax activation function
 If a layer with a sigmoid activation function is used as the output layer
instead, the predictions by the NN may not be easy to interpret
o Note that an output layer with sigmoid activations can still be used for binary
classification A Layer with Sigmoid Activations

3
z1 
0.95
y1  z1 
1
z2  0.73
 
y2  z 2

-3
y3  z3 
0.05
z3 

40
Softmax Layer
Introduction to Neural Networks

• The softmax layer applies softmax activations to Probability:

output a probability value in the range [0, 1]
 The values z inputted to the softmax layer are
referred to as logits
A Softmax Layer
3
3 20 0.88
 e
z1 z1 zj
z1 e e y1 e
j 1

1 2.7 0.12 3
z2 e e z2
 y2 e z2
e
zj

j 1
0.05 ≈0
z3 -3
3
e e z3
 y3 e z3
e
zj

3 j 1

 e
zj

j 1

41
Activation Functions
Introduction to Neural Networks

• Non-linear activations are needed to learn complex (non-linear) data

representations
 Otherwise, NNs would be just a linear function (such as )
 NNs with large number of layers (and neurons) can approximate more
complex functions
o Figure: more neurons improve representation (but, may overfit)

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 42

Activation: Sigmoid
Introduction to Neural Networks

• Sigmoid function σ: takes a real-valued number and “squashes” it

into the range between 0 and 1
 The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
 When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
 Sigmoid activations are less common in modern NNs

𝑓 (𝑥 ) ℝ𝑛 → [ 0 , 1 ]

𝑥
43
Activation: Tanh
Introduction to Neural Networks

• Tanh function: takes a real-valued number and “squashes” it into

range between -1 and 1
 Like sigmoid, tanh neurons saturate
 Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
 Tanh is a scaled sigmoid:

𝑓 (𝑥 ) ℝ 𝑛 → [ −1 , 1 ]

𝑥
44
Activation: ReLU
Introduction to Neural Networks

• ReLU (Rectified Linear Unit): takes a real-valued number and

thresholds it at zero
ℝ𝑛 → ℝ+ ¿ ¿
𝑛

 Most modern deep NNs use ReLU

activations
 ReLU is fast to compute 𝑓 (𝑥 )
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
 Accelerates the convergence of
gradient descent
o Due to linear, non-saturating form
 Prevents the gradient vanishing 𝑥
problem

45
Activation: Leaky ReLU
Introduction to Neural Networks

• The problem of ReLU activations: they can “die”

 ReLU could cause weights to update in a way that the gradients can
become zero and the neuron will not activate again on any data
 E.g., when a large learning rate is used

• Leaky ReLU activation function is a variant of ReLU

 Instead of the function being 0 when , a leaky ReLU has a small negative
slope (e.g., α = 0.01, or similar)
 This resolves the dying ReLU
problem
 Most current works still use ReLU
𝑓 (𝑥 )= {
𝛼 𝑥 for 𝑥 <0
𝑥 for 𝑥 ≫ 0
o With a proper setting of the learning
rate, the problem of dying ReLU can
be avoided

46
Activation: Linear Function
Introduction to Neural Networks

• Linear function means that the output signal is proportional to the

input signal to the neuron ℝ𝑛 → ℝ𝑛
 If the value of the constant c is 1, it
is also called identity activation 𝑓 ( 𝑥 ) =𝑐𝑥
function
 This activation type is used in
regression problems
o E.g., the last layer can have linear
activation function, in order to output
a real number (and not a class
membership)

47
Training NNs
Training Neural Networks

• The network parameters include the weight matrices and bias

vectors from all layers
𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿

 Often, the model parameters are referred to as weights

• Training a model to learn a set of parameters that are optimal
(according to a criterion) is one of the greatest challenges in ML

x1 …… y1
0.1 is 1

Softmax
…… y2
0.7 is 2
……

……

……
x256 …… y10
0.2 is 0
16 x 16 = 256
48
Training NNs
Training Neural Networks

• Data preprocessing - helps convergence during training

 Mean subtraction, to obtain zero-centered data
o Subtract the mean for each individual data dimension (feature)
 Normalization
o Divide each feature by its standard deviation
– To obtain standard deviation of 1 for each data dimension (feature)
o Or, scale the data within the range [0,1] or [-1, 1]
– E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range

Picture from: https://cs231n.github.io/neural-networks-2/ 49

Training NNs
Training Neural Networks

• To train a NN, set the parameters such that for a training subset of
images, the corresponding elements in the predicted output have
maximum values

Input: y1 has the maximum value

Input: y2 has the maximum value

.
.
.

Input: y9 has the maximum value

Input: y10 has the maximum value

50
Training NNs
Training Neural Networks

• Define a loss function/objective function/cost function that

calculates the difference (error) between the model prediction and
the true label
 E.g., can be mean-squared error, cross-entropy, etc.

x1 …… y1 0.2 1
x2 …… y2 0.3 0
Cost
……

……
……

……
……
……
x256 …… y10 0.5 ℒ(𝜃) 0
True label “1”

51
Training NNs
Training Neural Networks

• For a training set of images, calculate the total loss overall all
images:
• Find the optimal parameters that minimize the total loss

ℒ1 ( 𝜃 )
x1 NN ^
𝑦
1
y1
ℒ2 ( 𝜃 )
x2 NN ^
𝑦
2
y2

ℒ3 ( 𝜃 )
x3 NN ^
𝑦
3
y3
……
……

……
……

ℒ𝑛 ( 𝜃 )
xN NN ^
𝑦
𝑁 yN
52
Loss Functions
Training Neural Networks

• Classification tasks

Pairs of 𝑁 inputs and ground-truth class labels

Training
examples

Output Softmax Activations

Layer [maps to a probability distribution]

𝑁 𝐾
1
Loss function ℒ ( 𝜃 )=−
Cross-entropy ∑ ∑
𝑁 𝑖 =1 𝑘=1
[𝑦 (𝑖𝑘 ) + ( 1− 𝑦 (𝑖𝑘 ) ) log ( 1 − ^
𝑦 (𝑖𝑘 ) log ^ 𝑦 (𝑘𝑖 ) ) ]
Ground-truth class labels and model predicted class labels

53
Loss Functions
Training Neural Networks

• Regression tasks

Pairs of 𝑁 inputs and ground-truth output values

Training
examples

Output
Linear (Identity) or Sigmoid Activation
Layer

𝑛
1
ℒ ( 𝜃 )= ∑ ( 𝑦 − 𝑦 )
(𝑖) (𝑖) 2
Mean Squared Error ^
Loss function 𝑛 𝑖=1
𝑛
1
Mean Absolute Error ℒ ( 𝜃 )= ∑ | 𝑦
(𝑖)
− ^
𝑦
(𝑖)
|
𝑛 𝑖=1

54
Training NNs
Training Neural Networks

• Optimizing the loss function

 Almost all DL models these days are trained with a variant of the gradient
descent (GD) algorithm
 GD applies iterative refinement of the network parameters
 GD uses the opposite direction of the gradient of the loss with respect to
the NN parameters (i.e., ) for updating
o The gradient of the loss function gives the direction of fastest increase of the
loss function when the parameters are changed

ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖

𝜃𝑖

55
Gradient Descent Algorithm
Training Neural Networks

• Steps in the gradient descent algorithm:

1. Randomly initialize the model parameters
2. Compute the gradient of the loss function at the initial parameters :
3. Update the parameters as:
o Where α is the learning rate
4. Go to step 2 and repeat (until a terminating criterion is reached)

Loss Initial Gradient

parameters

Parameter update:

Global loss minimum

Parameters
56
Gradient Descent Algorithm
Training Neural Networks

• Example: a NN with only 2 parameters and , i.e.,

 The different colors represent the values of the loss (minimum loss is ≈
1.3)
1. Randomly pick a
starting point

2. Compute the
gradient at ,
∗
𝜃
𝑤2 3. Times the learning
1 rate , and update
𝜃
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2, repeat
0
𝜃

𝑤1
𝛻 ℒ ( 𝜃 )=
0

[ 𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 )/ 𝜕 𝑤 2
0
]
57
Gradient Descent Algorithm
Training Neural Networks

• Example (contd.)

Eventually, we would reach a minimum …..

2. Compute the gradient

2
𝜃 at ,
𝜃1 − 𝛼 𝛻 ℒ ( 𝜃 1 )
𝑤2 𝜃2 − 𝛼 𝛻 ℒ ( 𝜃2 ) 3. Times the learning rate ,
1 and update
𝜃

4. Go to step 2, repeat
0
𝜃

𝑤1 58
Gradient Descent Algorithm
Training Neural Networks

• Gradient descent algorithm stops when a local minimum of the loss

surface is reached
 GD does not guarantee reaching a global minimum
 However, empirical evidence suggests that GD works well for NNs

Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ 59

Gradient Descent Algorithm
Training Neural Networks

• For most tasks, the loss surface is highly complex (and non-convex)
• Random initialization in NNs
results in different initial
parameters every time the NN ℒ
is trained
 Gradient descent may reach
different minima at every run
 Therefore, NN will produce
different predicted outputs
• In addition, currently we don’t
have algorithms that guarantee
reaching a global minimum for 𝑤1 𝑤2
an arbitrary loss function

60
Backpropagation
Training Neural Networks

• Modern NNs employ the backpropagation method for calculating

the gradients of the loss function
 Backpropagation is short for “backward propagation”
• For training NNs, forward propagation (forward pass) refers to
passing the inputs through the hidden layers to obtain the model
outputs (predictions)
 The loss function is then calculated
 Backpropagation traverses the network in reverse order, from the outputs
backward toward the inputs to calculate the gradients of the loss
 The chain rule is used for calculating the partial derivatives of the loss
function with respect to the parameters in the different layers in the
network
• Each update of the model parameters during training takes one
forward and one backward pass (e.g., of a batch of inputs)
• Automatic calculation of the gradients (automatic differentiation) is
available in all current deep learning libraries
 It significantly simplifies the implementation of deep learning algorithms,
since it obviates deriving the partial derivatives of the loss function by 61
Mini-batch Gradient Descent
Training Neural Networks

• It is wasteful to compute the loss over the entire training dataset to

perform a single parameter update for large datasets
 E.g., ImageNet has 14M images
 Therefore, GD (a.k.a. vanilla GD) is almost always replaced with mini-batch
GD
• Mini-batch gradient descent
 Approach:
o Compute the loss on a mini-batch of images, update the parameters , and repeat
until all images are used
o At the next epoch, shuffle the training data, and repeat the above process
 Mini-batch GD results in much faster training
 Typical mini-batch size: 32 to 256 images
 It works because the gradient from a mini-batch is a good approximation
of the gradient from the entire training set

62
Stochastic Gradient Descent
Training Neural Networks

• Stochastic gradient descent

 SGD uses mini-batches that consist of a single input example
o E.g., one image mini-batch
 Although this method is very fast, it may cause significant fluctuations in
the loss function
o Therefore, it is less commonly used, and mini-batch GD is preferred
 In most DL libraries, SGD typically means a mini-batch GD (with an option
to add momentum)

63
Problems with Gradient Descent
Training Neural Networks

• Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points
cost

Very slow at the plateau

Stuck at a saddle point

Stuck at a local minimum

𝛻 ℒ ( 𝜃 ) ≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃
64
Gradient Descent with Momentum
Training Neural Networks

• Gradient descent with momentum uses the momentum of the

gradient for parameter optimization

cost
Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
65
Gradient Descent with Momentum
Training Neural Networks

• Parameters update in GD with momentum at iteration :

o Where: =
o I.e.,
• Compare to vanilla GD:
 Where are the parameters from the previous iteration
• The term is called momentum
 This term accumulates the gradients from the past several steps, i.e.,
=

 This term is analogous to a momentum of a heavy ball rolling down the hill
• The parameter is referred to as a coefficient of momentum
 A typical value of the parameter is 0.9
• This method updates the parameters in the direction of the weighted
average of the past gradients

66
Nesterov Accelerated Momentum
Training Neural Networks

• Gradient descent with Nesterov accelerated momentum

 Parameter update:
o Where: =
 The term allows to predict the position of the parameters in the next step
(i.e., )
 The gradient is calculated with respect to the approximate future position
of the parameters in the next iteration, , calculated at iteration

GD with Nesterov
GD with momentum
momentum

Picture from: https://towardsdatascience.com/learning-parameters-part-2-a190bef2d12 67

Adam
Training Neural Networks

• Adaptive Moment Estimation (Adam)

 Adam combines insights from the momentum optimizers that accumulate
the values of past gradients, and it also introduces new terms based on the
second moment of the gradient
o Similar to GD with momentum, Adam computes a weighted average of past
gradients (first moment of the gradient), i.e., =
o Adam also computes a weighted average of past squared gradients (second
moment of the gradient), , i.e., =
 The parameter update is:
o Where: and
o The proposed default values are = 0.9, = 0.999, and
• Other commonly used optimization methods include:
 Adagrad, Adadelta, RMSprop, Nadam, etc.
 Most commonly used optimizers nowadays are Adam and SGD with
momentum

68
Learning Rate
Training Neural Networks

• Learning rate
 The gradient tells us the direction in which the loss has the steepest rate
of increase, but it does not tell us how far along the opposite direction we
should step
 Choosing the learning rate (also called the step size) is one of the most
important hyper-parameter settings for NN training

LR too LR too
small large

69
Learning Rate
Training Neural Networks

• Training loss for different learning rates

 High learning rate: the loss increases or plateaus too quickly
 Low learning rate: the loss decreases too slowly (takes many epochs to
reach a solution)

Picture from: https://cs231n.github.io/neural-networks-3/ 70

Learning Rate Scheduling
Training Neural Networks

• Learning rate scheduling is applied to change the values of the

learning rate during the training
 Annealing is reducing the learning rate over time (a.k.a. learning rate
decay)
o Approach 1: reduce the learning rate by some factor every few epochs
– Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every
20 epochs
o Approach 2: exponential or cosine decay gradually reduce the learning rate over
time
o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the
validation loss stops improving
– In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
» Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before
applying it), Minimum learning rate: 1e-6 (when to stop)
 Warmup
Exponential decay
is gradually increasing the learning
Cosine decay rate initially,Warmup
and afterward
let it cool down until the end of the training

71
Vanishing Gradient Problem
Training Neural Networks

• In some cases, during training, the gradients can become either very
small (vanishing gradients) of very large (exploding gradients)
 They result in very small or very large update of the parameters
 Solutions: change learning rate, ReLU activations, regularization, LSTM
units in RNNs

x1 …… y1
x2 …… y2
……

……
……

……

……
xN …… yM

Small gradients, learns very slow

72
Generalization
Generalization

• Underfitting
 The model is too “simple” to
represent all the relevant class
characteristics
 E.g., model with too few
parameters
 Produces high error on the training
set and high error on the validation
set

• Overfitting
 The model is too “complex” and fits
irrelevant characteristics (noise) in
the data
 E.g., model with too many
parameters
 Produces low error on the training
73
Overfitting
Generalization

• Overfitting – a model with high capacity fits the noise in the data
instead of the underlying relationship

• The model may fit the training

data very well, but fails to
generalize to new examples (test
or validation data)

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 74

Regularization: Weight Decay
Regularization

• weight decay
 A regularization term that penalizes large weights is added to the loss
function
Data loss Regularization loss

 For every weight in the network, we add the regularization term to the
loss value
o During gradient descent parameter update, every weight is decayed linearly
toward zero
 The weight decay coefficient determines how dominant the regularization
is during the gradient computation

75
Regularization: Weight Decay
Regularization

• Effect of the decay coefficient

 Large weight decay coefficient → penalty for weights with large values

76
Regularization: Weight Decay
Regularization

• weight decay
 The regularization term is based on the norm of the weights

 weight decay is less common with NN

o Often performs worse than weight decay
 It is also possible to combine and regularization
o Called elastic net regularization

77
Regularization: Dropout
Regularization

• Dropout
 Randomly drop units (along with their connections) during training
 Each unit is retained with a fixed dropout rate p, independent of other
units
 The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped

78
Regularization: Dropout
Regularization

• Dropout is a kind of ensemble learning

 Using one mini-batch to train one network with a slightly different
architecture

minibatch minibatch minibatch minibatch

1 2 3 n

……

79
Regularization: Early Stopping
Regularization

• Early-stopping
 During model training, use a validation set
o E.g., validation/train ratio of about 25% to 75%
 Stop when the validation accuracy (or loss) has not improved after n
epochs
o The parameter n is called patience

Stop training

validation

80
Batch Normalization
Regularization

• Batch normalization layers act similar to the data preprocessing

steps mentioned earlier
 They calculate the mean μ and variance σ of a batch of input data, and
normalize the data x to a zero mean and unit variance
 I.e.,
• BatchNorm layers alleviate the problems of proper initialization of
the parameters and hyper-parameters
 Result in faster convergence training, allow larger learning rates
 Reduce the internal covariate shift
• BatchNorm layers are inserted immediately after convolutional layers
or fully-connected layers, and before activation layers
 They are very common with convolutional NNs

81
Hyper-parameter Tuning
Hyper-parameter Tuning

• Training NNs can involve setting many hyper-parameters

• The most common hyper-parameters include:
 Number of layers, and number of neurons per layer
 Initial learning rate
 Learning rate decay schedule (e.g., decay constant)
 Optimizer type
• Other hyper-parameters may include:
 Regularization parameters ( penalty, dropout rate)
 Batch size
 Activation functions
 Loss function
• Hyper-parameter tuning can be time-consuming for larger NNs

82
Hyper-parameter Tuning
Hyper-parameter Tuning

• Grid search
 Check all values in a range with a step value
• Random search
 Randomly sample values for the parameter
 Often preferred to grid search
• Bayesian hyper-parameter optimization
 Is an active area of research

83
k-Fold Cross-Validation
k-Fold Cross-Validation

• Using k-fold cross-validation for hyper-parameter tuning is

common when the size of the training data is small
 It also leads to a better and less noisy estimate of the model performance
by averaging the results across several folds
• E.g., 5-fold cross-validation (see the figure on the next slide)
1. Split the train data into 5 equal folds
2. First use folds 2-5 for training and fold 1 for validation
3. Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5
4. Average the results over the 5 runs (for reporting purposes)
5. Once the best hyper-parameters are determined, evaluate the model on
the test data

84
k-Fold Cross-Validation
k-Fold Cross-Validation

• Illustration of a 5-fold cross-validation

Picture from: https://scikit-learn.org/stable/modules/cross_validation.html 85

Ensemble Learning
Ensemble Learning

• Ensemble learning is training multiple classifiers separately and

combining their predictions
 Ensemble learning often outperforms individual classifiers
 Better results obtained with higher model variety in the ensemble
 Bagging (bootstrap aggregating)
o Randomly draw subsets from the training set (i.e., bootstrap samples)
o Train separate classifiers on each subset of the training set
o Perform classification based on the average vote of all classifiers
 Boosting
o Train a classifier, and apply weights on the training set (apply higher weights on
misclassified examples, focus on “hard examples”)
o Train new classifier, reweight training set according to prediction error
o Repeat
o Perform classification based on weighted vote of the classifiers

86
Deep vs Shallow Networks
Deep vs Shallow Networks

• Deeper networks perform better than shallow networks

 But only up to some limit: after a certain number of layers, the
performance of deeper networks plateaus

output

Shallow Deep
NN NN

……

x1 x2 …… xN

input
87
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• Convolutional neural networks (CNNs) were primarily designed

for image data
• CNNs use a convolutional operator for extracting data features
 Allows parameter sharing
 Efficient to train
 Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image

Convolutional
Input matrix 3x3 filter

Picture from: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 88

Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• When the convolutional filters are scanned over the image, they
capture useful features
 E.g., edge detection by convolutions

0 1
Filter 0
1 -4
1 1 1 1 1
1
1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 0 1 0.996078 0.058824 0.015686
1 0.996078 0.996078
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 00.015686 0.007843 0.007843 1 0.352941
1 0.988235 0.027451
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1

Input Convoluted
Image Image

89
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• In CNNs, hidden units in a layer are only connected to a small region

of the layer before it (called local receptive field)
 The depth of each feature map corresponds to the number of convolutional
filters used at each layer

w1 w2

w3 w4 w5 w6

w7 w8
Filter
1 Filter
Input 2
Image Layer 1
Feature Layer 2
Map Feature
Map
90
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• Max pooling: reports the maximum output within a rectangular

neighborhood
• Average pooling: reports the average output of a rectangular
neighborhood
• Pooling layers reduce the spatial size of the feature maps
 Reduce the number of parameters, prevent overfitting
MaxPool with a 2×2 filter with stride of 2
1 3 5 3
4 5
4 2 3 1
3 4
3 1 1 3
0 1 0 4
Output Matrix
Input Matrix

91
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• Feature extraction architecture

 After 2 convolutional layers, a max-pooling layer reduces the size of the
feature maps (typically by 2)
 A fully convolutional and a softmax layers are added last to perform
classification
Living
Room

Bedroom
128

256
256

512
512

512
512
128

256

512

512
Kitchen
64
64

Bathroom

Outdoor
Conv
layer
Max
Pool
Fully Connected
Layer

92
Residual CNNs
Convolutional Neural Networks

• Residual networks (ResNets)

 Introduce “identity” skip connections
o Layer inputs are propagated and added to the layer output
o Mitigate the problem of vanishing gradients during training
o Allow training very deep NN (with over 1,000 layers)
 Several ResNet variants exist: 18, 34, 50, 101, 152, and
200 layers
 Are used as base models of other state-of-the-art NNs
o Other similar models: ResNeXT, DenseNet

93
Recurrent Neural Networks
(RNNs)
Recurrent Neural Networks

• Recurrent NNs are used for modeling sequential data and data with
varying length of inputs and outputs
 Videos, text, speech, DNA sequences, human skeletal data
• RNNs introduce recurrent connections between the neurons
 This allows processing sequential data one element at a time by selectively
passing information across a sequence
 Memory of the previous inputs is stored in the model’s internal state and
affect the model predictions
 Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than
CNNs

94
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks

• RNN use same set of weights and across all time steps
 A sequence of hidden states is learned, which represents the memory of
the network
 The hidden state at step t, , is calculated based on the previous hidden
state and the input at the current step , i.e.,
 The function is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time

HIDDEN STATES SEQUENCE: OUTPUT

𝑤h 𝑤h 𝑤h 𝑤𝑦
h0 (·) h1 (·) h2 (·) h3 (·)

𝑤𝑥 𝑤𝑥 𝑤𝑥

x1 x2 x3

INPUT SEQUENCE:
95
Recurrent Neural Networks
(RNNs)
Recurrent Neural Networks

• RNNs can have one of many inputs and one of many outputs

RNN Application Input Output

A person riding a
Image
motorbike on dirt
Captioning
road

Sentiment Awesome movie.

Analysis Highly Positive
recommended.

Machine शुभ
Happy Diwali
Translation दीपावली

96
Bidirectional RNNs
Recurrent Neural Networks

• Bidirectional RNNs incorporate both forward and backward passes

through sequential data
 The output may not only depend on the previous elements in the sequence,
but also on future elements in the sequence
 It resembles two RNNs stacked on top of each other

h⃑ 𝑡=𝜎 ( ⃑
𝑊 ⃑h𝑡 − 1+ ⃑
(hh ) (h𝑥)
𝑊 𝑥𝑡 )
´ (hh ) h́ + 𝑊
h́ 𝑡=𝜎 ( 𝑊 ´ (h𝑥) 𝑥 )
𝑡 +1 𝑡

𝑦 𝑡= 𝑓 ( [ ⃑h𝑡 ; h́ 𝑡 ] )

Outputs both past and future elements

97
LSTM Networks
Recurrent Neural Networks

• Long Short-Term Memory (LSTM) networks are a variant of RNNs

• LSTM mitigates the vanishing/exploding gradient problem
 Solution: a Memory Cell, updated at each step in the sequence
• Three gates control the flow of information to and from the Memory
Cell
 Input Gate: protects the current step from irrelevant inputs
 Output Gate: prevents current step from passing irrelevant information to
later steps
 Forget Gate: limits information passed from one cell to the next
• Most modern RNN models use either LSTM units or other more
advanced types of recurrent units (e.g., GRU units)

98
LSTM Networks
Recurrent Neural Networks

• LSTM cell
 Input gate, output gate, forget gate, memory cell
 LSTM can learn long-term correlations within data sequences

99
References

1. Hung-yi Lee – Deep Learning Tutorial

2. Ismini Lourentzou – Introduction to Deep Learning
3. CS231n Convolutional Neural Networks for Visual Recognition
(Stanford CS course) (link)
4. James Hays, Brown – Machine Learning Overview
5. Param Vir Singh, Shunyuan Zhang, Nikhil Malik – Deep Learning
6. Sebastian Ruder – An Overview of Gradient Descent Optimization
Algorithms (link)

100

POLY23ProvisionalMerit Diploma AI
No ratings yet
POLY23ProvisionalMerit Diploma AI
2,971 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
Machine Learning BE Merged Modules
No ratings yet
Machine Learning BE Merged Modules
561 pages
Agricultural Science Syllabus - DRAFT Dec 2023 Jan2024
No ratings yet
Agricultural Science Syllabus - DRAFT Dec 2023 Jan2024
90 pages
Expanded Definition
86% (7)
Expanded Definition
9 pages
Module 2 - Deep - Learning - Fundamentals
No ratings yet
Module 2 - Deep - Learning - Fundamentals
98 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
99 pages
Lecture 3 Deep Learning
No ratings yet
Lecture 3 Deep Learning
98 pages
Lecture 8 Deep Learning Overview PDF
No ratings yet
Lecture 8 Deep Learning Overview PDF
98 pages
Module 1 Lesson 1 SocStEd 311
No ratings yet
Module 1 Lesson 1 SocStEd 311
30 pages
UNit 1 Introduction To ML
No ratings yet
UNit 1 Introduction To ML
225 pages
Basics of Machine Learning and Deep Learning
No ratings yet
Basics of Machine Learning and Deep Learning
49 pages
PESTLE and PEST Analysis - Apple Case
No ratings yet
PESTLE and PEST Analysis - Apple Case
17 pages
DTS 101 Lecture 1
No ratings yet
DTS 101 Lecture 1
22 pages
Deep Learning Overview
No ratings yet
Deep Learning Overview
102 pages
Deep Learning ICV EXAM Materiel
No ratings yet
Deep Learning ICV EXAM Materiel
98 pages
Slide 10 Chapter9 Classification Advanced Methods
No ratings yet
Slide 10 Chapter9 Classification Advanced Methods
46 pages
Deep Learning - A Gentle Introduction
No ratings yet
Deep Learning - A Gentle Introduction
100 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
98 pages
Chapter 5 - Machine Learning Basics
No ratings yet
Chapter 5 - Machine Learning Basics
58 pages
Googlepreview
100% (2)
Googlepreview
174 pages
Lecture01 Introduction To Machine Learning (Chapter1)
No ratings yet
Lecture01 Introduction To Machine Learning (Chapter1)
64 pages
Unit 1 PDF
No ratings yet
Unit 1 PDF
135 pages
02 ML Fundatmentals 2
No ratings yet
02 ML Fundatmentals 2
81 pages
Lecture 1 - Introduction To Machine Learning-HO - Ch0
No ratings yet
Lecture 1 - Introduction To Machine Learning-HO - Ch0
44 pages
AI Presentation Machine Learning
100% (2)
AI Presentation Machine Learning
42 pages
01 ML Overview Notes
No ratings yet
01 ML Overview Notes
22 pages
Click Clack The Rattle Bag
No ratings yet
Click Clack The Rattle Bag
4 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
99 pages
Machine Learning (R20a0518)
No ratings yet
Machine Learning (R20a0518)
87 pages
Concours D
No ratings yet
Concours D
50 pages
Architecture 04 00039 With Cover
No ratings yet
Architecture 04 00039 With Cover
19 pages
CS480 Lecture November 14th
No ratings yet
CS480 Lecture November 14th
72 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Unit 1
No ratings yet
Unit 1
38 pages
Air and Water Grade 5
100% (1)
Air and Water Grade 5
44 pages
Fruit Recognition Deep Learning
No ratings yet
Fruit Recognition Deep Learning
53 pages
Final Report For Ai Tutor
No ratings yet
Final Report For Ai Tutor
21 pages
Placemaking and The Future of Cities
No ratings yet
Placemaking and The Future of Cities
35 pages
Mlfa Autumn 22 Lec 01
No ratings yet
Mlfa Autumn 22 Lec 01
43 pages
Pre Necta STD Iv No 8, Mesp Tanzania
No ratings yet
Pre Necta STD Iv No 8, Mesp Tanzania
12 pages
Unit 1
100% (1)
Unit 1
13 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
24 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Lecture 2 Introduction To ML
No ratings yet
Lecture 2 Introduction To ML
35 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Unit-1 DL
No ratings yet
Unit-1 DL
29 pages
DL Unit 1
No ratings yet
DL Unit 1
21 pages
San Gabriel Senior High School
100% (1)
San Gabriel Senior High School
8 pages
Clienteles and Audiences in Social Work: Humss7
No ratings yet
Clienteles and Audiences in Social Work: Humss7
13 pages
ML Report
No ratings yet
ML Report
19 pages
Basics of Machine Learning
100% (4)
Basics of Machine Learning
22 pages
Tirth PDF
No ratings yet
Tirth PDF
19 pages
Lec 001
No ratings yet
Lec 001
17 pages
ML - Part - A
No ratings yet
ML - Part - A
10 pages
Machine Learning-Lecture 01
No ratings yet
Machine Learning-Lecture 01
28 pages
Hypo Testing
No ratings yet
Hypo Testing
10 pages
University Institute of Engineering Department of Computer Science and Engg
No ratings yet
University Institute of Engineering Department of Computer Science and Engg
27 pages
Chapter 01 Machine Learning
No ratings yet
Chapter 01 Machine Learning
22 pages
Chapter 9. Classification: Advanced Methods
No ratings yet
Chapter 9. Classification: Advanced Methods
39 pages
Unit-1 Introduction To Machine Learning
No ratings yet
Unit-1 Introduction To Machine Learning
24 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
CP Presentation Affan, Hammad, Arman, Shayan
No ratings yet
CP Presentation Affan, Hammad, Arman, Shayan
18 pages
Machine Learning Concept1
No ratings yet
Machine Learning Concept1
16 pages
Katalog Mata Kuliah ICE Institute Semester 2023 - 2 - Download
No ratings yet
Katalog Mata Kuliah ICE Institute Semester 2023 - 2 - Download
9 pages
ITA6016 - Machine Learning Introduction
No ratings yet
ITA6016 - Machine Learning Introduction
13 pages
Project 2
No ratings yet
Project 2
5 pages
Vitamin For Final MCQ
No ratings yet
Vitamin For Final MCQ
14 pages
Machine Learning For Data Science Unit-4
No ratings yet
Machine Learning For Data Science Unit-4
16 pages
Notes Unit 1
No ratings yet
Notes Unit 1
13 pages
STAT 451: Introduction To Machine Learning Lecture Notes
No ratings yet
STAT 451: Introduction To Machine Learning Lecture Notes
22 pages
ML Overview Notes
No ratings yet
ML Overview Notes
23 pages
Virtual Memory Examples: Problem 1
No ratings yet
Virtual Memory Examples: Problem 1
6 pages
Course Meeting Adhoc
No ratings yet
Course Meeting Adhoc
1 page
ICT Academy Conclave OD
No ratings yet
ICT Academy Conclave OD
1 page
ML Unit 1
No ratings yet
ML Unit 1
9 pages
Icaihc 25
No ratings yet
Icaihc 25
1 page
Machine Learning Mid 2 Set 1
No ratings yet
Machine Learning Mid 2 Set 1
6 pages
Machinelearning Unit1
No ratings yet
Machinelearning Unit1
9 pages
Thekto12 Curriculum: Valdez, Rose Therese J
No ratings yet
Thekto12 Curriculum: Valdez, Rose Therese J
17 pages
A Bistable Shape Memory Microvalve
No ratings yet
A Bistable Shape Memory Microvalve
4 pages
SK Proficiency Test Summery
No ratings yet
SK Proficiency Test Summery
4 pages
Template - CV Akademik S2-S3 (English)
No ratings yet
Template - CV Akademik S2-S3 (English)
2 pages
Calculating Potential Energy and Kinetic Energy PDF
No ratings yet
Calculating Potential Energy and Kinetic Energy PDF
3 pages
Fundamentals of Epidemiology (EPID 610) Exercise 13 Screening Learning Objectives
No ratings yet
Fundamentals of Epidemiology (EPID 610) Exercise 13 Screening Learning Objectives
4 pages
Statistics - Honours: Paper: CC-4 (Probability and Probability Distributions - II) Full Marks: 50
No ratings yet
Statistics - Honours: Paper: CC-4 (Probability and Probability Distributions - II) Full Marks: 50
2 pages
Intensive English 9
No ratings yet
Intensive English 9
4 pages
BE Semester-IV (ATKT IT) Question Bank Microprocessor Architecture & Programming All Questions Carry Equal Marks (10 Marks)
No ratings yet
BE Semester-IV (ATKT IT) Question Bank Microprocessor Architecture & Programming All Questions Carry Equal Marks (10 Marks)
2 pages
Microprocessors Questions and Answers - Hybrid Architecture - RISC and CISC Convergence, Advantages of RISC, Design Issues of RISC Processors - 1
No ratings yet
Microprocessors Questions and Answers - Hybrid Architecture - RISC and CISC Convergence, Advantages of RISC, Design Issues of RISC Processors - 1
2 pages
Bus List and Attendee List For CHC Picnic
No ratings yet
Bus List and Attendee List For CHC Picnic
2 pages
Aicte Neyveli Data Sciences
No ratings yet
Aicte Neyveli Data Sciences
3 pages
Project Report 2
No ratings yet
Project Report 2
11 pages
Bits Tips
No ratings yet
Bits Tips
2 pages
BSD - 019 Bulk Earthworks
No ratings yet
BSD - 019 Bulk Earthworks
3 pages
105 Machine Learning Paper
No ratings yet
105 Machine Learning Paper
6 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

Date: Venue:: 28-11-2023, Saveetha School of Engineering

Uploaded by

Date: Venue:: 28-11-2023, Saveetha School of Engineering

Uploaded by

Date: 28-11-2023, Venue: Saveetha School of Engineering

• Machine learning basics

• Artificial Intelligence is a scientific field concerned with the

• Supervised: learning with labeled data

• Supervised learning categories and techniques

• Unsupervised learning categories and techniques

• For image classification, the distance between all pixels is calculated

Picture from: https://cs231n.github.io/classification/ 9

• k-Nearest Neighbors approach considers multiple neighboring data

Picture from: James Hays – Machine Learning Overview 10

• The decision boundary is linear

Picture from: https://cs231n.github.io/classification/ 12

• Support vector machines (SVM)

• Linear classification techniques

• For some tasks, input

• For other tasks, linear

• A classification problem with only 2 classes is referred to as binary

• Both the binary and multi-class classification problems can be linearly

• Computer vision has been the primary area of interest for ML

• Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems

• Conventional machine learning methods rely on human-designed

• Deep learning (DL) is a machine learning subfield that uses multiple

Picture from: https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png 24

• DL applies a multi-layer process for learning rich hierarchical

Low-Level Mid-Level High-Level Trainable

• DL provides a flexible, learnable framework for representing visual,

• NNs with at least one hidden layer are universal approximators

• NNs use nonlinear mapping of the inputs x to

• Handwritten digit recognition (MNIST dataset)

• Handwritten digit recognition

• NNs consist of hidden layers with neurons (i.e., computational units)

• A NN with one hidden layer and one output layer

𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓 𝒚 =𝝈 (𝑾 𝟐 𝒉+𝒃𝟐 )

4 + 2 = 6 neurons (not counting inputs)

• A neural network playground link

• Deep NNs have many hidden layers

Input Layer 1 Layer 2 Layer L Output

Input Layer Output Layer

• A simple network, toy example

0.98 Sigmoid Function

• A simple network, toy example (cont’d)

1 4 0.98 2 0.86 3 0.62

• Matrix operations are helpful when working with multidimensional

• Multilayer NN, matrix calculations for the first layer

• Multilayer NN, matrix calculations for all layers

• Multilayer NN, function f maps inputs x to outputs y, i.e.,

• In multi-class classification tasks, the output layer is typically a

• The softmax layer applies softmax activations to Probability:

• Non-linear activations are needed to learn complex (non-linear) data

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 42

• Sigmoid function σ: takes a real-valued number and “squashes” it

• Tanh function: takes a real-valued number and “squashes” it into

• ReLU (Rectified Linear Unit): takes a real-valued number and

 Most modern deep NNs use ReLU

• The problem of ReLU activations: they can “die”

• Leaky ReLU activation function is a variant of ReLU

• Linear function means that the output signal is proportional to the

• The network parameters include the weight matrices and bias

 Often, the model parameters are referred to as weights

• Data preprocessing - helps convergence during training

Picture from: https://cs231n.github.io/neural-networks-2/ 49

Input: y1 has the maximum value

Input: y2 has the maximum value

Input: y9 has the maximum value

Input: y10 has the maximum value

• Define a loss function/objective function/cost function that

Pairs of 𝑁 inputs and ground-truth class labels

Output Softmax Activations

Pairs of 𝑁 inputs and ground-truth output values

• Optimizing the loss function

• Steps in the gradient descent algorithm:

Loss Initial Gradient

Global loss minimum

• Example: a NN with only 2 parameters and , i.e.,

Eventually, we would reach a minimum …..

2. Compute the gradient